Introduction

Our research focuses on performing Exploratory Data Analysis (EDA) on Google Play Store apps to uncover patterns, trends, and insights regarding app characteristics, user behavior, and installation patterns. We are trying to see how app popularity, defined as the number of installs with high reviews and ratings, is impacted by categories, last updated, app sizes, version, and other factors.

Smart Question

“What is the impact of content rating, required Android version, app category, size, and pricing on predicting app success in terms of positive ratings and high user reviews, as well as the number of installs, using data from Google Play Store apps from 2010 to 2018?”

Specific: The question clearly defines the variables (content rating, required Android version, app category, size, pricing) and the outcomes (positive ratings, high user reviews, number of installs).

Measurable: The outcomes (positive ratings, high user reviews, number of installs) are quantifiable.

Achievable: Given the availability of Google Play Store data from 2010 to 2018, the analysis is feasible.

Relevant: The question addresses a significant issue in the app development and marketing industry: predicting app success.

Time-specific: The timeframe (2010-2018) is clearly defined.

Data Preparation and Cleaning

Here, we have loaded the dataset ‘Google Play Store Apps’ stored in csv file using ()

#Loading the Dataset
data_apps <- data.frame(read.csv("googleplaystore.csv"))
#Checking the structure of the data
str(data_apps)
## 'data.frame':    10841 obs. of  13 variables:
##  $ App           : chr  "Photo Editor & Candy Camera & Grid & ScrapBook" "Coloring book moana" "U Launcher Lite – FREE Live Cool Themes, Hide Apps" "Sketch - Draw & Paint" ...
##  $ Category      : chr  "ART_AND_DESIGN" "ART_AND_DESIGN" "ART_AND_DESIGN" "ART_AND_DESIGN" ...
##  $ Rating        : num  4.1 3.9 4.7 4.5 4.3 4.4 3.8 4.1 4.4 4.7 ...
##  $ Reviews       : chr  "159" "967" "87510" "215644" ...
##  $ Size          : chr  "19M" "14M" "8.7M" "25M" ...
##  $ Installs      : chr  "10,000+" "500,000+" "5,000,000+" "50,000,000+" ...
##  $ Type          : chr  "Free" "Free" "Free" "Free" ...
##  $ Price         : chr  "0" "0" "0" "0" ...
##  $ Content.Rating: chr  "Everyone" "Everyone" "Everyone" "Teen" ...
##  $ Genres        : chr  "Art & Design" "Art & Design;Pretend Play" "Art & Design" "Art & Design" ...
##  $ Last.Updated  : chr  "January 7, 2018" "January 15, 2018" "August 1, 2018" "June 8, 2018" ...
##  $ Current.Ver   : chr  "1.0.0" "2.0.0" "1.2.4" "Varies with device" ...
##  $ Android.Ver   : chr  "4.0.3 and up" "4.0.3 and up" "4.0.3 and up" "4.2 and up" ...
#First 5 rows of the dataset
head(data_apps)
##                                                  App       Category Rating
## 1     Photo Editor & Candy Camera & Grid & ScrapBook ART_AND_DESIGN    4.1
## 2                                Coloring book moana ART_AND_DESIGN    3.9
## 3 U Launcher Lite – FREE Live Cool Themes, Hide Apps ART_AND_DESIGN    4.7
## 4                              Sketch - Draw & Paint ART_AND_DESIGN    4.5
## 5              Pixel Draw - Number Art Coloring Book ART_AND_DESIGN    4.3
## 6                         Paper flowers instructions ART_AND_DESIGN    4.4
##   Reviews Size    Installs Type Price Content.Rating                    Genres
## 1     159  19M     10,000+ Free     0       Everyone              Art & Design
## 2     967  14M    500,000+ Free     0       Everyone Art & Design;Pretend Play
## 3   87510 8.7M  5,000,000+ Free     0       Everyone              Art & Design
## 4  215644  25M 50,000,000+ Free     0           Teen              Art & Design
## 5     967 2.8M    100,000+ Free     0       Everyone   Art & Design;Creativity
## 6     167 5.6M     50,000+ Free     0       Everyone              Art & Design
##       Last.Updated        Current.Ver  Android.Ver
## 1  January 7, 2018              1.0.0 4.0.3 and up
## 2 January 15, 2018              2.0.0 4.0.3 and up
## 3   August 1, 2018              1.2.4 4.0.3 and up
## 4     June 8, 2018 Varies with device   4.2 and up
## 5    June 20, 2018                1.1   4.4 and up
## 6   March 26, 2017                1.0   2.3 and up

Description of the App Dataset Columns

  1. App: The name of the application, represented as a character string.
  2. Category: The main category of the app, such as “ART_AND_DESIGN,” represented as a character string.
  3. Rating: The average user rating of the app, recorded as a numeric value.
  4. Reviews: The total number of user reviews for the app, shown as a character string.
  5. Size: The size of the application, represented as a character string.
  6. Installs: The approximate number of installations for the app, stored as a character string.
  7. Type: Indicates whether the app is free or paid, represented as a character string.
  8. Price: The price of the app, stored as a character string. Free apps are listed as “0,” while paid apps have a dollar amount.
  9. Content.Rating: The target age group for the app, represented as a character string.
  10. Genres: The genre(s) of the app.
  11. Last.Updated: The date of the app’s last update, stored as a character string.
  12. Current.Ver: The current version of the app, represented as a character string.
  13. Android.Ver: The minimum Android version required to run the app, stored as a character string.

Apps

# Checking the type of the App 
typeof(data_apps$App)
## [1] "character"

Checking for duplicated apps and removing

#Display all the duplicated Apps
duplicate_apps <- aggregate(App ~ ., data = data_apps, FUN = length)  
duplicate_apps <- duplicate_apps[duplicate_apps$App > 1, ] 
duplicate_apps <- duplicate_apps[order(-duplicate_apps$App), ] 

#View(duplicate_apps)
#print(duplicate_apps)

print(paste("Number of duplicated Apps:",nrow(duplicate_apps)))
## [1] "Number of duplicated Apps: 404"
#Removing Na values and duplicates
data_clean <- data_apps[!is.na(data_apps$App), ] 
data_clean <- data_clean[!duplicated(data_clean$App), ] 

#(After removing the duplicates) Unique values
unique_apps <- length(unique(data_clean$App))
print(paste("Number of unique apps after removing the duplicates:", unique_apps))
## [1] "Number of unique apps after removing the duplicates: 9660"

Duplicate App Analysis:

  • 404 apps were repeated either twice or thrice.
  • After removing duplicates, the dataset now contains 9660 unique apps.
  • Total duplicates removed: 1181 apps.

After dropping duplicate

str(data_clean$App)
##  chr [1:9660] "Photo Editor & Candy Camera & Grid & ScrapBook" ...

Price

typeof(data_apps$Price)
## [1] "character"

Convertion of Price to numerical

There is ‘$’ present after each price of the App. Check and remove before conversion.

#To check if there is dollar symbol present 
#data_clean$Price[]
# Remove dollar symbols and convert to numeric
data_clean$Price <- as.numeric(gsub("\\$", "", data_clean$Price))
#Recheck for dollar symbol
#data_clean$Price[]

All the dollar symbols are removed succesfully.

# Summary statistics for price
summary(data_clean$Price)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   0.000   0.000   1.099   0.000 400.000       1

From the unique_df, there is a missing value present in the Price column. Let’s handle it!

Checking for missing values in Price

missing_na <- is.na(data_clean$Price)    
missing_blank <- data_clean$Price == "" 

sum(missing_na)
## [1] 1
sum(missing_blank, na.rm = TRUE)
## [1] 0
# Remove row where Price is NA or blank
data_clean <- data_clean[!is.na(data_clean$Price) & data_clean$Price != "", ]

Have removed one row #10473 which app does not have a category nameas it is not relevant to our analysis.

#Recheck for missing values
summary(data_clean$Price)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   0.000   1.099   0.000 400.000
Missing values removed succesfully. (Price)

Type

#Checking the type of Type variable
table(data_clean$Type)
## 
## Free Paid 
## 8902  756

From the price column, we can see 8903 apps are free but it is misread somewhere in the Type column. So lets check!

#Checking for Missing values
print(paste("Missing values:",sum(is.na(data_clean$Type))))
## [1] "Missing values: 0"
data_clean[is.na(data_clean$Type), ]
##  [1] App            Category       Rating         Reviews        Size          
##  [6] Installs       Type           Price          Content.Rating Genres        
## [11] Last.Updated   Current.Ver    Android.Ver   
## <0 rows> (or 0-length row.names)
# Replace NaN or missing values in the Type column with "Free"
data_clean$Type[is.na(data_clean$Type)] <- "Free"

There is one row 9150, has a missing value for Type. As the price is 0, replaced it with “Free”.

Missing values handles succesfully. (Type)

Size

# Checking the type of the Size 
typeof(data_apps$Size)
## [1] "character"

Replacing Misiing values with the mean (Size)

# Replace "Varies with Device" in the Size column with NA
data_clean$Size[data_clean$Size == "Varies with device"] <- NA
data_clean <- data_clean[!grepl("\\+", data_clean$Size), ]
data_clean$Size <- ifelse(grepl("k", data_clean$Size),
                          as.numeric(gsub("k", "", data_clean$Size)) *
0.001,  # Convert "K" to MB
                          as.numeric(gsub("M", "", data_clean$Size)))
# Remove "M" for megabytes
# Calculate and display the mean size for each category in the 'Type' column
mean_size_by_type <- tapply(data_clean$Size, data_clean$Category,
mean, na.rm = TRUE)
print(mean_size_by_type)
##      ART_AND_DESIGN   AUTO_AND_VEHICLES              BEAUTY BOOKS_AND_REFERENCE 
##           12.370968           20.037147           13.795745           13.134701 
##            BUSINESS              COMICS       COMMUNICATION              DATING 
##           13.867194           13.794959           11.307430           15.661119 
##           EDUCATION       ENTERTAINMENT              EVENTS              FAMILY 
##           19.057101           23.043750           13.963754           27.187988 
##             FINANCE      FOOD_AND_DRINK                GAME  HEALTH_AND_FITNESS 
##           17.368127           20.494318           41.866609           20.669707 
##      HOUSE_AND_HOME  LIBRARIES_AND_DEMO           LIFESTYLE MAPS_AND_NAVIGATION 
##           15.970258           10.602883           14.844916           16.368121 
##             MEDICAL  NEWS_AND_MAGAZINES           PARENTING     PERSONALIZATION 
##           19.189399           12.470189           22.512963           11.224624 
##         PHOTOGRAPHY        PRODUCTIVITY            SHOPPING              SOCIAL 
##           15.666158           12.342505           15.491435           15.984090 
##              SPORTS               TOOLS    TRAVEL_AND_LOCAL       VIDEO_PLAYERS 
##           24.058361            8.782837           24.204410           15.792756 
##             WEATHER 
##           12.680036
# Loop through each row and replace NA values in the Size column with the mean size of the corresponding category
data_clean$Size <- ifelse(
  is.na(data_clean$Size),  # Check if Size is NA
  round(mean_size_by_type[data_clean$Category], 1),  # Replace with the mean size based on the Category
  data_clean$Size  # Keep the original size if it's not NA
)

Installs

####Remove the ‘+’ sign, Remove the commas, Convert to numeric

#clean installations
clean_installs <- function(Installs) {
  Installs <- gsub("\\+", "", Installs)  
  Installs <- gsub(",", "", Installs)    
  return(as.numeric(Installs))           
}

data_clean$Installs <- sapply(data_clean$Installs, clean_installs)

nan_rows <- sapply(data_clean[, c("Size", "Installs")], function(x) any(is.nan(x)))

# Display only rows that contain NaN in either Size or Installs
data_clean[,nan_rows]
## data frame with 0 columns and 9659 rows
datatable((data_clean), options = list(scrollX = TRUE ))

Display the unique values

data_clean <- data_clean %>%
  mutate(Rating = ifelse(is.na(Rating), mean(Rating, na.rm = TRUE), Rating))

# Identify the unique values in the 'Installs' column
unique_values <- unique(data_clean$Installs)

# Display the unique values
print(unique_values)
##  [1] 1e+04 5e+05 5e+06 5e+07 1e+05 5e+04 1e+06 1e+07 5e+03 1e+08 1e+09 1e+03
## [13] 5e+08 5e+01 1e+02 5e+02 1e+01 1e+00 5e+00 0e+00
# Function to convert the installs to numeric
convert_to_numeric <- function(x) {
  # Remove non-numeric characters and convert to numeric
  as.numeric(gsub("[^0-9]", "", x)) * 10^(length(gregexpr(",", x)[[1]]) - 1)
}

# Sort unique values based on the custom numeric conversion
sorted_values <- unique_values[order(sapply(unique_values, convert_to_numeric))]

Rating and Reviews

# Checking the type of the Rating 
typeof(data_clean$Rating)
## [1] "double"
# Checking the type of the Reviews 
typeof(data_clean$Reviews)
## [1] "character"

Checking the format of Rating and Reviews

##  chr [1:9659] "159" "967" "87510" "215644" "967" "167" "178" "36815" ...
##  num [1:9659] 4.1 3.9 4.7 4.5 4.3 4.4 3.8 4.1 4.4 4.7 ...

As we can see the Review column is in string format which could be converted into int for more insights.

Checking the unique values for reviews and rating

unique_values <- unique(data_clean$Reviews)
unique_values1 <- unique(data_clean$Rating)
# Display the unique values
print(unique_values)
##    [1] "159"      "967"      "87510"    "215644"   "167"      "178"     
##    [7] "36815"    "13791"    "121"      "13880"    "8788"     "44829"   
##   [13] "4326"     "1518"     "55"       "3632"     "27"       "194216"  
##   [19] "224399"   "450"      "654"      "7699"     "61"       "118"     
##   [25] "192"      "20260"    "203"      "136"      "223"      "1120"    
##   [31] "227"      "5035"     "1015"     "353"      "564"      "8145"    
##   [37] "36639"    "158"      "591"      "117"      "176"      "295221"  
##   [43] "2206"     "26"       "174531"   "1070"     "85"       "845"     
##   [49] "367"      "1598"     "284"      "17057"    "129"      "542"     
##   [55] "10479"    "805"      "1403"     "3971"     "534"      "7774"    
##   [61] "38846"    "2431"     "6090"     "295"      "190"      "40211"   
##   [67] "356"      "52530"    "116986"   "1379"     "271920"   "736"     
##   [73] "7021"     "197"      "737"      "3574"     "994"      "197136"  
##   [79] "142"      "15168"    "2155"     "138"      "5414"     "21777"   
##   [85] "348"      "250"      "13372"    "7880"     "3617"     "4806"    
##   [91] "65786"    "31433"    "5097"     "1754"     "2680"     "1288"    
##   [97] "18900"    "49790"    "1150"     "1739"     "32090"    "2225"    
##  [103] "4369"     "8572"     "964"      "42050"    "104"      "17934"   
##  [109] "601"      "36"       "187"      "182"      "30"       "134"     
##  [115] "74"       "113715"   "3595"     "9315"     "75"       "38"      
##  [121] "26834"    "119"      "2277"     "2280"     "184"      "9"       
##  [127] "77"       "35"       "364"      "18"       "473"      "66"      
##  [133] "3871"     "257"      "62"       "2914724"  "1857"     "4478"    
##  [139] "577550"   "814080"   "246315"   "454060"   "155446"   "418"     
##  [145] "22486"    "203130"   "1435"     "116507"   "1433233"  "90468"   
##  [151] "860"      "363934"   "87873"    "17506"    "1862"     "2084"    
##  [157] "47303"    "19080"    "161"      "85842"    "7831"     "91615"   
##  [163] "4620"     "21336"    "26875"    "1778"     "2709"     "64513"   
##  [169] "8342"     "527"      "1322"     "1680"     "2739"     "1065"    
##  [175] "233757"   "2"        "51269"    "30105"    "156"      "114"     
##  [181] "341157"   "16129"    "674730"   "1254730"  "85185"    "32584"   
##  [187] "217730"   "70991"    "1002861"  "16589"    "148945"   "4458"    
##  [193] "62272"    "8941"     "46353"    "1279184"  "88073"    "67000"   
##  [199] "159872"   "30847"    "188841"   "11622"    "95912"    "4034"    
##  [205] "45964"    "14955"    "6903"     "31614"    "23055"    "19023"   
##  [211] "207372"   "1225"     "380837"   "10600"    "74359"    "822"     
##  [217] "80805"    "2287"     "4162"     "14760"    "23243"    "8978"    
##  [223] "42492"    "286897"   "103755"   "46505"    "11442"    "10295"   
##  [229] "296"      "29313"    "51507"    "1802"     "1383"     "23175"   
##  [235] "5868"     "2111"     "5448"     "4159"     "20815"    "78662"   
##  [241] "7149"     "3079"     "5800"     "6989"     "16422"    "108741"  
##  [247] "624"      "1661"     "97702"    "308"      "5211"     "1058"    
##  [253] "78172"    "413"      "1013635"  "24005"    "57106"    "2249"    
##  [259] "516"      "834"      "1010"     "238970"   "302"      "438"     
##  [265] "73"       "39"       "144"      "2181"     "93965"    "1446"    
##  [271] "12088"    "314"      "25671"    "15194"    "22551"    "29839"   
##  [277] "279"      "564387"   "1330"     "1677"     "757"      "115"     
##  [283] "125"      "9952"     "18814"    "21"       "15"       "51981"   
##  [289] "3596"     "1006"     "5968"     "4895"     "56642847" "69119316"
##  [295] "125257"   "9642995"  "1429035"  "4604324"  "3419249"  "11334799"
##  [301] "158679"   "3075028"  "4187998"  "659395"   "4785892"  "66602"   
##  [307] "30209"    "36901"    "5149854"  "192948"   "99559"    "437674"  
##  [313] "13698"    "2473509"  "20769"    "36880"    "171031"   "63543"   
##  [319] "45487"    "615381"   "2451083"  "33053"    "5387333"  "3648120" 
##  [325] "136662"   "42370"    "781810"   "3128250"  "2083237"  "541389"  
##  [331] "46702"    "2939"     "13761"    "258556"   "40751"    "17712922"
##  [337] "25021"    "27187"    "122498"   "132014"   "83239"    "594728"  
##  [343] "10484169" "2876500"  "28238"    "335646"   "350154"   "349384"  
##  [349] "346982"   "244863"   "10790289" "330761"   "37320"    "12842860"
##  [355] "2546527"  "15880"    "2264916"  "42925"    "2511130"  "13100"   
##  [361] "27156"    "55098"    "1133501"  "12578"    "10965"    "18247"   
##  [367] "190613"   "125232"   "72065"    "27540"    "104990"   "177703"  
##  [373] "177263"   "237468"   "32254"    "483565"   "552441"   "60308"   
##  [379] "457283"   "93825"    "32283"    "15287"    "205739"   "14873"   
##  [385] "7820209"  "9498"     "88427"    "305218"   "183374"   "20901"   
##  [391] "122595"   "124346"   "837842"   "255"      "41420"    "44706"   
##  [397] "23707"    "29208"    "191032"   "1545"     "57"       "0"       
##  [403] "4"        "516801"   "285726"   "76646"    "2556"     "7779"    
##  [409] "61637"    "12632"    "313724"   "48845"    "305708"   "31320"   
##  [415] "172460"   "4195"     "11633"    "10212"    "37053"    "667"     
##  [421] "13202"    "28671"    "1157"     "212626"   "222888"   "2067"    
##  [427] "1643"     "105"      "3414"     "42194"    "11806"    "1999"    
##  [433] "22544"    "97684"    "2519"     "1146"     "13046"    "17268"   
##  [439] "8722"     "953"      "2593"     "5377"     "852"      "212"     
##  [445] "1972"     "35206"    "5164"     "1939"     "277"      "80"      
##  [451] "825"      "40035"    "1093"     "135418"   "1601"     "2212"    
##  [457] "57081"    "241"      "63986"    "7888"     "535"      "5084"    
##  [463] "2430"     "837"      "738"      "4631"     "4953"     "1439"    
##  [469] "337"      "51698"    "923"      "149"      "198"      "23170"   
##  [475] "13890"    "13440"    "143"      "1059"     "894"      "6191"    
##  [481] "15081"    "218"      "243950"   "236"      "5152"     "1576"    
##  [487] "6701"     "742"      "2506"     "182986"   "8661"     "8"       
##  [493] "59"       "28"       "185"      "6"        "110"      "3"       
##  [499] "5"        "84"       "20"       "776"      "1"        "14"      
##  [505] "24"       "23"       "11"       "101"      "120"      "41605"   
##  [511] "791"      "5323"     "478"      "69"       "53"       "6289924" 
##  [517] "181893"   "2544"     "85375"    "314299"   "9770"     "32346"   
##  [523] "4075"     "10611"    "9321"     "56065"    "14286"    "133136"  
##  [529] "2469"     "36212"    "390"      "1090"     "266948"   "342918"  
##  [535] "748"      "172640"   "1619"     "3168"     "29855"    "6736"    
##  [541] "7005"     "889"      "5741"     "27572"    "10852"    "9888"    
##  [547] "1929"     "1516"     "215301"   "423"      "254519"   "1107903" 
##  [553] "211856"   "99020"    "90481"    "32381"    "248912"   "248555"  
##  [559] "272145"   "56897"    "8599"     "41185"    "29980"    "304"     
##  [565] "358"      "11904"    "22251"    "22384"    "73404"    "12733"   
##  [571] "25183"    "52743"    "61749"    "55704"    "19277"    "13612"   
##  [577] "37862"    "18372"    "656"      "240"      "275"      "3692"    
##  [583] "2363"     "1769"     "430"      "756"      "3963"     "316"     
##  [589] "642"      "172505"   "69493"    "7973"     "5695"     "142632"  
##  [595] "55256"    "54798"    "4815"     "75112"    "33646"    "206527"  
##  [601] "9348"     "3816"     "47847"    "16195"    "206"      "28392"   
##  [607] "3241"     "835"      "2525"     "828"      "200058"   "1239"    
##  [613] "702"      "108613"   "148550"   "3847"     "84309"    "14206"   
##  [619] "14700"    "42828"    "40209"    "1405"     "256079"   "2078"    
##  [625] "16103"    "31085"    "3528"     "5456208"  "11656"    "28948"   
##  [631] "296771"   "470089"   "10939"    "98509"    "5241"     "22508"   
##  [637] "10291"    "7165362"  "4885646"  "141980"   "6979"     "46618"   
##  [643] "103078"   "17682"    "37000"    "175528"   "1828284"  "34923"   
##  [649] "684116"   "46916"    "407698"   "702975"   "32458"    "235496"  
##  [655] "11661"    "653008"   "23063"    "87384"    "411683"   "8918"    
##  [661] "501498"   "2133296"  "29690"    "130549"   "613059"   "1633682" 
##  [667] "2646"     "21867"    "32732"    "243747"   "2639"     "1511"    
##  [673] "44550"    "7813"     "1033"     "2442"     "15254"    "155234"  
##  [679] "310066"   "12216"    "388089"   "92058"    "88185"    "493"     
##  [685] "33387"    "123279"   "27424"    "60841"    "29706"    "288150"  
##  [691] "14807"    "319692"   "61201"    "22998"    "12398"    "35928"   
##  [697] "64448"    "22378"    "16372"    "58028"    "736864"   "1968"    
##  [703] "35279"    "17247"    "87723"    "18523"    "182103"   "197774"  
##  [709] "8674"     "58082"    "115033"   "801"      "8968"     "303"     
##  [715] "732"      "1856"     "50725"    "1575"     "6238"     "9941"    
##  [721] "23666"    "67554"    "38769"    "160164"   "3771"     "256664"  
##  [727] "787177"   "3782"     "40113"    "7074"     "2153"     "26089"   
##  [733] "20611"    "811"      "15558"    "573"      "37"       "8232"    
##  [739] "3089"     "3874"     "464"      "731"      "8800"     "99"      
##  [745] "456"      "3200"     "5839"     "663"      "16"       "13"      
##  [751] "46"       "1953"     "12"       "4298"     "49"       "100"     
##  [757] "399"      "7"        "124424"   "39041"    "52306"    "36718"   
##  [763] "42644"    "278082"   "6076"     "112656"   "335738"   "31906"   
##  [769] "20672"    "957973"   "130582"   "31218"    "167168"   "34428"   
##  [775] "15247"    "48445"    "35518"    "12185"    "36746"    "21996"   
##  [781] "138371"   "12073"    "111632"   "250706"   "706301"   "64959"   
##  [787] "659741"   "510392"   "7215"     "25508"    "60449"    "381788"  
##  [793] "10697"    "347838"   "31804"    "3856"     "199684"   "44545"   
##  [799] "1336246"  "57493"    "283"      "12304"    "8188"     "11919"   
##  [805] "45957"    "126431"   "21570"    "134564"   "16961"    "111254"  
##  [811] "7731"     "5928"     "15703"    "6148"     "861"      "8662"    
##  [817] "23130"    "69973"    "1311"     "26587"    "2417"     "1054"    
##  [823] "25166"    "129304"   "19870"    "161440"   "7514"     "46106"   
##  [829] "15141"    "714"      "42410"    "260547"   "4344"     "22808"   
##  [835] "42809"    "16808"    "157505"   "24647"    "1922"     "3334"    
##  [841] "10658"    "78361"    "2594"     "13868"    "135952"   "11066"   
##  [847] "34861"    "37580"    "281485"   "685"      "3780"     "15192"   
##  [853] "5950"     "15993"    "5905"     "14627"    "1098"     "2898"    
##  [859] "70782"    "11264"    "100997"   "3290"     "341090"   "87951"   
##  [865] "24729"    "78306"    "43313"    "1374549"  "208463"   "6998"    
##  [871] "145323"   "95"       "64784"    "32997"    "82"       "2707"    
##  [877] "129737"   "611136"   "6118"     "2473"     "109784"   "3320"    
##  [883] "68103"    "8412"     "10741"    "3803"     "155944"   "10159"   
##  [889] "28008"    "43614"    "455377"   "1398"     "1032935"  "32405"   
##  [895] "151080"   "22513"    "90042"    "58316"    "8509"     "19314"   
##  [901] "21314"    "30224"    "454"      "14952"    "1250"     "1726"    
##  [907] "14065"    "556"      "4925"     "6507"     "11707"    "1077"    
##  [913] "46539"    "9066"     "1962"     "22071"    "196"      "278"     
##  [919] "61881"    "2129"     "1268"     "91359"    "22015"    "131569"  
##  [925] "31986"    "22875"    "17071"    "90242"    "483960"   "511228"  
##  [931] "1920"     "40116"    "51517"    "7690"     "321134"   "3755"    
##  [937] "104504"   "333208"   "35218"    "116403"   "37517"    "292969"  
##  [943] "428156"   "1577"     "38098"    "31139"    "272337"   "220125"  
##  [949] "400592"   "20098"    "117925"   "548021"   "48276"    "471036"  
##  [955] "12705"    "706"      "465"      "644"      "144040"   "51227"   
##  [961] "357417"   "199"      "827597"   "9116"     "2071"     "50294"   
##  [967] "708674"   "1140"     "232153"   "14709"    "12029"    "1873516" 
##  [973] "2880"     "270267"   "559186"   "77777"    "8642"     "501144"  
##  [979] "1861"     "1203"     "299"      "115721"   "14810"    "183662"  
##  [985] "27393"    "10445"    "49479"    "4848"     "20812"    "328469"  
##  [991] "100406"   "205299"   "66791"    "399009"   "5420"     "130104"  
##  [997] "251534"   "28951"    "60096"    "106547"   "134195"   "249855"  
## [1003] "109756"   "38343"    "190247"   "75571"    "70769"    "2107"    
## [1009] "26540"    "1608"     "19074"    "7976"     "7586"     "2885"    
## [1015] "48226"    "1026"     "28945"    "11506"    "6826"     "111450"  
## [1021] "19543"    "233243"   "11689"    "77563"    "5499"     "48286"   
## [1027] "26652"    "71269"    "20301"    "93691"    "56145"    "20326"   
## [1033] "12955"    "2681"     "325738"   "4102"     "40296"    "4559407" 
## [1039] "570242"   "121838"   "62616"    "12858"    "34356"    "50679"   
## [1045] "16943"    "524299"   "267"      "623"      "117176"   "70416"   
## [1051] "15674"    "14402"    "141163"   "69395"    "27439"    "2490"    
## [1057] "24094"    "18539"    "3061"     "229210"   "20547"    "3405"    
## [1063] "217"      "7895"     "32606"    "1324"     "126017"   "14394"   
## [1069] "1812"     "13724"    "10253"    "4642"     "16570"    "20161"   
## [1075] "2894"     "5038"     "31665"    "13799"    "111462"   "57634"   
## [1081] "8576"     "417907"   "3167"     "27386"    "162243"   "65913"   
## [1087] "24977"    "6000"     "37711"    "175293"   "174"      "353799"  
## [1093] "2758"     "1437"     "7573"     "8481"     "10054"    "10117"   
## [1099] "39189"    "3522"     "71419"    "36857"    "39123"    "14653"   
## [1105] "23013"    "287"      "4435"     "43800"    "4281"     "7508"    
## [1111] "491"      "160"      "22584"    "4087"     "2496"     "103305"  
## [1117] "2669"     "10"       "7619"     "126"      "273"      "2248"    
## [1123] "809"      "3280"     "1478"     "2382"     "4450"     "515"     
## [1129] "4465"     "2427"     "6631"     "11200"    "6896"     "3834"    
## [1135] "81"       "2087"     "58"       "3014"     "487"      "67007"   
## [1141] "539"      "126862"   "48"       "1465"     "929"      "783"     
## [1147] "2907"     "434"      "54"       "411"      "237"      "2580"    
## [1153] "363"      "130272"   "91"       "130"      "25"       "7396"    
## [1159] "58055"    "1703"     "7750"     "12657"    "1919"     "60170"   
## [1165] "831"      "8671"     "31"       "20145"    "912"      "102"     
## [1171] "3945"     "2221"     "3781"     "1267"     "18968"    "47497"   
## [1177] "140995"   "51357"    "13565"    "39364"    "7287"     "161143"  
## [1183] "16168"    "116079"   "815893"   "985"      "4260"     "726074"  
## [1189] "3829"     "33572"    "6145"     "34327"    "7457"     "41941"   
## [1195] "82145"    "10944"    "665"      "2167"     "53652"    "18961"   
## [1201] "9412"     "9663"     "23164"    "3031"     "95557"    "7869"    
## [1207] "4212"     "17368"    "6554"     "33264"    "34782"    "6676"    
## [1213] "1067"     "1797"     "367505"   "20304"    "7376"     "49147"   
## [1219] "69177"    "3448"     "39724"    "3788"     "95736"    "1658"    
## [1225] "3309"     "987"      "5208"     "78298"    "6808"     "12452"   
## [1231] "360"      "16637"    "95904"    "3114"     "220"      "33"      
## [1237] "1533"     "28301"    "3937"     "21195"    "2042"     "13213"   
## [1243] "118034"   "9464"     "10097"    "28588"    "19621"    "10544"   
## [1249] "4427"     "50338"    "3346"     "4447388"  "27722264" "22426677"
## [1255] "254258"   "148897"   "369203"   "5234162"  "23133508" "8118609" 
## [1261] "10485308" "1497361"  "59800"    "2610526"  "4066989"  "3778921" 
## [1267] "6198563"  "10306"    "44891723" "1000417"  "17039"    "685981"  
## [1273] "10393"    "14198297" "592068"   "1732263"  "295241"   "1135631" 
## [1279] "5566669"  "1295557"  "270687"   "2157930"  "506275"   "4920817" 
## [1285] "23005"    "68057"    "1300490"  "8923587"  "4128732"  "42053"   
## [1291] "257724"   "990491"   "10216538" "7614130"  "760628"   "9881829" 
## [1297] "2123381"  "7671249"  "3197865"  "1351068"  "5418675"  "1889250" 
## [1303] "183846"   "230710"   "5465624"  "1534466"  "14891223" "18985"   
## [1309] "655067"   "1385093"  "2698348"  "1125017"  "9882639"  "74673"   
## [1315] "5387639"  "2750410"  "461137"   "946926"   "9305"     "360630"  
## [1321] "6074334"  "8118880"  "10424925" "37023"    "422244"   "98123"   
## [1327] "21262"    "118253"   "141529"   "70226"    "2251012"  "30253"   
## [1333] "15763"    "84911"    "46416"    "7196"     "48256"    "6427773" 
## [1339] "25825"    "55380"    "148177"   "3715656"  "1841061"  "101686"  
## [1345] "275447"   "174755"   "4578476"  "145353"   "531458"   "195558"  
## [1351] "29940"    "41975"    "216675"   "2311785"  "165888"   "541144"  
## [1357] "1107310"  "29168"    "189773"   "337752"   "102107"   "3093358" 
## [1363] "42079"    "240416"   "32506"    "70747"    "358817"   "15403"   
## [1369] "38957"    "214777"   "100609"   "1343866"  "168717"   "549720"  
## [1375] "18996"    "25094"    "93033"    "120592"   "187972"   "484981"  
## [1381] "2055"     "73539"    "59017"    "5829"     "18621"    "19922"   
## [1387] "21119"    "7412"     "18125"    "10795"    "13004"    "38207"   
## [1393] "9394"     "3883589"  "2719142"  "931595"   "1480189"  "2468063" 
## [1399] "309176"   "807338"   "446434"   "522466"   "584126"   "32551"   
## [1405] "90218"    "212524"   "2045554"  "745684"   "416540"   "16601"   
## [1411] "3057481"  "224514"   "26247"    "10055521" "21892"    "197540"  
## [1417] "29445"    "1083571"  "4230886"  "2119218"  "1327265"  "1242855" 
## [1423] "401425"   "10979062" "515657"   "955656"   "1468591"  "725897"  
## [1429] "549039"   "1559650"  "292164"   "520962"   "1381820"  "525517"  
## [1435] "696"      "194969"   "327599"   "3816799"  "105620"   "37139"   
## [1441] "147791"   "347883"   "343263"   "216849"   "354373"   "753043"  
## [1447] "43055"    "80678"    "153381"   "559"      "3073251"  "26649"   
## [1453] "2151039"  "306764"   "280098"   "26985"    "1125438"  "42145"   
## [1459] "171448"   "104303"   "47644"    "125647"   "4355"     "214878"  
## [1465] "811040"   "155186"   "34494"    "260651"   "4638163"  "234110"  
## [1471] "48615"    "14774"    "12753"    "33983"    "20267"    "5761"    
## [1477] "11618"    "12948"    "11436"    "2150"     "382"      "24936"   
## [1483] "1109"     "108795"   "1455"     "1024"     "1014822"  "86961"   
## [1489] "7320"     "269194"   "18616"    "11950"    "4289"     "11716"   
## [1495] "3323"     "36606"    "328619"   "46741"    "530854"   "7050"    
## [1501] "17753"    "520609"   "432"      "32029"    "4207"     "64"      
## [1507] "96"       "82471"    "496"      "29436"    "19230"    "11126"   
## [1513] "23671"    "9652"     "9626"     "29319"    "1791"     "9199"    
## [1519] "14014"    "110877"   "10366"    "530792"   "12137"    "6404"    
## [1525] "6356"     "169"      "15246"    "4076"     "106750"   "33785"   
## [1531] "58795"    "3235"     "47031"    "131"      "673203"   "2178"    
## [1537] "175625"   "8508"     "3484"     "379415"   "19245"    "24877"   
## [1543] "10088"    "3762"     "141363"   "472584"   "1329192"  "148295"  
## [1549] "41273"    "392596"   "514088"   "41867"    "23060"    "112080"  
## [1555] "15489"    "51895"    "623398"   "66661"    "10447"    "1574197" 
## [1561] "19170"    "169609"   "6188"     "1369"     "2952"     "9856"    
## [1567] "10753"    "154"      "288523"   "4522"     "3328"     "854"     
## [1573] "560"      "63186"    "23772"    "6007"     "2903"     "3063"    
## [1579] "3234"     "276"      "1595"     "879"      "68559"    "1123"    
## [1585] "566"      "97"       "214"      "248"      "2195"     "1615"    
## [1591] "359"      "38021"    "6190"     "13155"    "1160"     "59917"   
## [1597] "1042"     "253"      "3396"     "59729"    "133117"   "47213"   
## [1603] "95537"    "51838"    "36028"    "528745"   "44062"    "79667"   
## [1609] "20763"    "8126"     "50887"    "63160"    "28737"    "45579"   
## [1615] "19720"    "361970"   "159619"   "354384"   "2376564"  "129603"  
## [1621] "1135"     "578"      "63"       "216"      "171"      "45"      
## [1627] "717"      "2921"     "92"       "1361"     "395"      "79"      
## [1633] "51"       "576"      "168"      "163"      "319"      "133"     
## [1639] "17"       "726"      "492"      "41"       "625"      "59158"   
## [1645] "19473"    "73118"    "27524"    "102858"   "2094"     "33033"   
## [1651] "78825"    "18674"    "6266"     "22"       "53743"    "2657"    
## [1657] "4476"     "156410"   "2006"     "6099"     "23160"    "8348"    
## [1663] "7837"     "6185"     "1838"     "3707"     "315"      "87418"   
## [1669] "21189"    "1746"     "69126"    "453"      "10710"    "700"     
## [1675] "914"      "15875"    "503"      "488"      "26862"    "72167"   
## [1681] "6035"     "9945"     "4318"     "78"       "2218"     "47"      
## [1687] "23889"    "650"      "3498"     "3052"     "1747"     "1686"    
## [1693] "529"      "1388"     "572"      "2159"     "2951"     "15545"   
## [1699] "5521"     "2108"     "408"      "272"      "4852"     "90"      
## [1705] "4303"     "460"      "513"      "112"      "83"       "343"     
## [1711] "137"      "4107"     "44"       "124"      "330"      "531"     
## [1717] "56"       "123"      "122"      "3786"     "65"       "292"     
## [1723] "19"       "78158306" "66577313" "8606259"  "49173"    "2955326" 
## [1729] "22492"    "17014787" "4305441"  "441189"   "4831125"  "4919"    
## [1735] "13762"    "6086"     "70616"    "1200"     "76480"    "271445"  
## [1741] "225103"   "33177"    "54768"    "457197"   "25562"    "16404"   
## [1747] "2508"     "79658"    "374"      "1259075"  "4751"     "60562"   
## [1753] "22695"    "22098"    "231325"   "13223"    "479908"   "313633"  
## [1759] "540930"   "57146"    "1225339"  "19583"    "344921"   "79129"   
## [1765] "83488"    "3781770"  "315441"   "382120"   "412725"   "3806669" 
## [1771] "1259849"  "486824"   "1157004"  "423105"   "205803"   "285816"  
## [1777] "138026"   "8936"     "4253"     "14835"    "6388"     "309872"  
## [1783] "17955"    "51502"    "1064049"  "58341"    "637309"   "161610"  
## [1789] "19446"    "1520959"  "33249"    "1175794"  "852455"   "900064"  
## [1795] "3677"     "695613"   "207712"   "175722"   "2052"     "624557"  
## [1801] "522018"   "141613"   "6210998"  "591312"   "94294"    "608753"  
## [1807] "38961"    "5916606"  "441473"   "2788923"  "973270"   "909226"  
## [1813] "1573054"  "857923"   "35563"    "85858"    "106798"   "8820"    
## [1819] "37186"    "25714"    "44255"    "1084945"  "18252"    "42750"   
## [1825] "3656"     "216741"   "662287"   "30834"    "367290"   "2588"    
## [1831] "162655"   "3860225"  "125783"   "171584"   "244141"   "568273"  
## [1837] "142512"   "109124"   "135043"   "108592"   "1315242"  "48732"   
## [1843] "308234"   "178497"   "68406"    "6012719"  "45362"    "18364"   
## [1849] "15867"    "9189"     "23187"    "11798"    "9975"     "52896"   
## [1855] "7793"     "9701"     "31519"    "181990"   "213735"   "42871"   
## [1861] "315908"   "210208"   "1370749"  "79261"    "284725"   "67071"   
## [1867] "46153"    "101883"   "1659"     "57920"    "28523"    "24953"   
## [1873] "37253"    "33758"    "3420"     "7193"     "2278"     "5121"    
## [1879] "39735"    "1558"     "16966"    "41986"    "186116"   "13085"   
## [1885] "28560"    "34171"    "44588"    "4158"     "105773"   "279428"  
## [1891] "33583"    "94205"    "95520"    "181798"   "25719"    "4602"    
## [1897] "72596"    "110425"   "2375"     "35497"    "3878"     "44071"   
## [1903] "6380"     "5123"     "19232"    "98716"    "109500"   "21159"   
## [1909] "1320"     "50424"    "32398"    "10858556" "219745"   "38953"   
## [1915] "142634"   "259450"   "123029"   "914804"   "21841"    "5282578" 
## [1921] "10349"    "859"      "40289"    "3362"     "49680"    "3378"    
## [1927] "10525"    "3492"     "654419"   "1864"     "74476"    "221858"  
## [1933] "401820"   "3116"     "31985"    "4400"     "3337956"  "11677"   
## [1939] "3158047"  "125259"   "56114"    "15700"    "16523"    "26361"   
## [1945] "285788"   "4410"     "5855"     "47090"    "233039"   "28578"   
## [1951] "1159058"  "53421"    "1579287"  "116880"   "21730"    "62421"   
## [1957] "70189"    "847159"   "251951"   "240475"   "527247"   "55427"   
## [1963] "68070"    "2418135"  "329160"   "42677"    "1517369"  "811693"  
## [1969] "23440"    "3368649"  "29707"    "15098"    "35724"    "71898"   
## [1975] "4865093"  "106080"   "44941"    "129272"   "111066"   "49553"   
## [1981] "43296"    "130081"   "462152"   "140917"   "88860"    "49211"   
## [1987] "351254"   "157506"   "1871416"  "420973"   "34753"    "635846"  
## [1993] "78140"    "244371"   "12865"    "215343"   "6120977"  "753115"  
## [1999] "16320"    "852649"   "1494491"  "819774"   "33439"    "477831"  
## [2005] "126337"   "373606"   "15426"    "714340"   "1451000"  "1490732" 
## [2011] "2163282"  "1163232"  "751766"   "7594559"  "1028637"  "167652"  
## [2017] "1075277"  "7529865"  "93726"    "597068"   "823109"   "197295"  
## [2023] "21578"    "307453"   "32896"    "542561"   "462702"   "521138"  
## [2029] "283662"   "82882"    "459795"   "133825"   "911995"   "1733"    
## [2035] "342909"   "31908"    "56270"    "107724"   "101455"   "152780"  
## [2041] "21733"    "410384"   "277902"   "91031"    "112725"   "76346"   
## [2047] "63938"    "18678"    "39878"    "152867"   "63580"    "121003"  
## [2053] "135763"   "176448"   "1981"     "16016"    "9992"     "36490"   
## [2059] "122282"   "4011"     "25172"    "14123"    "2487"     "180938"  
## [2065] "950"      "232423"   "1664"     "1391"     "2486"     "188834"  
## [2071] "697"      "5510"     "4549"     "6106"     "288809"   "43611"   
## [2077] "78442"    "3017"     "30840"    "36255"    "752"      "926"     
## [2083] "26102"    "37167"    "2020"     "798"      "7543"     "1845"    
## [2089] "13098"    "1904"     "11085"    "3387"     "11151"    "29673"   
## [2095] "50017"    "63650"    "50179"    "9971"     "1660"     "361"     
## [2101] "5305"     "24082"    "990"      "32386"    "35394"    "28895"   
## [2107] "11549"    "34123"    "4057"     "5517"     "4272"     "666521"  
## [2113] "838765"   "2943"     "108318"   "68935"    "80900"    "75545"   
## [2119] "1605267"  "136626"   "219848"   "52029"    "49190"    "5150"    
## [2125] "64713"    "216388"   "481545"   "6012"     "260121"   "1830388" 
## [2131] "17878"    "26871"    "1162837"  "359403"   "9235155"  "33782"   
## [2137] "192641"   "2338655"  "751551"   "24781"    "32862"    "16101"   
## [2143] "76779"    "2129689"  "98585"    "459851"   "5572"     "16815"   
## [2149] "45562"    "285"      "256"      "21443"    "2750"     "7081"    
## [2155] "17882"    "421800"   "43935"    "171889"   "251037"   "15750"   
## [2161] "27560"    "150932"   "928"      "768833"   "26665"    "13275"   
## [2167] "48930"    "43054"    "15680"    "16980"    "5112"     "30447"   
## [2173] "165299"   "620534"   "599872"   "12564"    "2528"     "34"      
## [2179] "427"      "35560"    "397422"   "11182"    "16734"    "10035"   
## [2185] "20313"    "25740"    "42546"    "6925"     "24281"    "18039"   
## [2191] "22748"    "47780"    "7705"     "33256"    "14544"    "134895"  
## [2197] "48082"    "2419"     "17915"    "61776"    "10323"    "1609"    
## [2203] "30403"    "18622"    "8258"     "57573"    "17202"    "263525"  
## [2209] "14114"    "890"      "7153"     "149723"   "6762"     "120373"  
## [2215] "40225"    "42849"    "5960"     "18294"    "38655"    "8033493" 
## [2221] "5745093"  "18239"    "24199"    "37333"    "12759663" "33216"   
## [2227] "28860"    "76604"    "26189"    "739329"   "45838"    "25592"   
## [2233] "34126"    "2394"     "330468"   "136874"   "1236"     "7790693" 
## [2239] "537554"   "315585"   "12215"    "127223"   "9602"     "24151"   
## [2245] "45483"    "28250"    "1859115"  "74819"    "18513"    "14552"   
## [2251] "87055"    "17030"    "1280423"  "357"      "163997"   "5431"    
## [2257] "1420"     "118439"   "59973"    "32111"    "114788"   "70404"   
## [2263] "429580"   "876866"   "207706"   "4254879"  "111507"   "472904"  
## [2269] "115409"   "166367"   "618918"   "192661"   "54207"    "60571"   
## [2275] "1335799"  "148506"   "679912"   "152692"   "77311"    "48211"   
## [2281] "12388"    "85387"    "33509"    "342336"   "90831"    "12718"   
## [2287] "85659"    "16395"    "807"      "24265"    "1657"     "576454"  
## [2293] "88675"    "56848"    "19096"    "1116393"  "154578"   "40676"   
## [2299] "39833"    "273283"   "139480"   "801054"   "65597"    "1107320" 
## [2305] "4594198"  "94427"    "335115"   "33926"    "1028794"  "134203"  
## [2311] "15693"    "48979"    "410303"   "63712"    "10595"    "9496"    
## [2317] "428581"   "225544"   "3090727"  "474439"   "41137"    "75336"   
## [2323] "745245"   "4934130"  "137562"   "172990"   "68309"    "1121805" 
## [2329] "7146"     "6466641"  "49657"    "1724"     "139258"   "32794"   
## [2335] "102923"   "6702776"  "7583"     "24215"    "5073"     "15633"   
## [2341] "1141545"  "29485"    "142393"   "71688"    "114851"   "62209"   
## [2347] "202474"   "512102"   "298321"   "5783441"  "47393"    "55525"   
## [2353] "2267"     "266401"   "649568"   "6342"     "5413"     "1237135" 
## [2359] "4724"     "43960"    "3277209"  "229"      "6626"     "10796"   
## [2365] "273994"   "29203"    "18918"    "2056"     "440"      "66453"   
## [2371] "398"      "12089"    "2828"     "89342"    "495905"   "86743"   
## [2377] "30498"    "97890"    "25037"    "58617"    "62301"    "37237"   
## [2383] "1591"     "595120"   "100130"   "294701"   "63624"    "112977"  
## [2389] "58052"    "12180"    "118459"   "112479"   "69417"    "733838"  
## [2395] "157495"   "86481"    "77724"    "2390185"  "33074"    "35771"   
## [2401] "290241"   "69488"    "10401"    "28806"    "11343"    "66730"   
## [2407] "25807"    "251616"   "87300"    "181399"   "881"      "40704"   
## [2413] "624924"   "51145"    "647844"   "2591941"  "7435"     "115773"  
## [2419] "12008"    "2084126"  "536926"   "3016297"  "1188154"  "2731171" 
## [2425] "8226"     "228794"   "3252896"  "226456"   "1079491"  "9653"    
## [2431] "58675"    "1038306"  "287250"   "480643"   "27800"    "691474"  
## [2437] "2131"     "5383985"  "26559"    "13500"    "8550"     "1861310" 
## [2443] "23089"    "6949"     "858208"   "815981"   "102451"   "56403"   
## [2449] "115072"   "19302"    "1488396"  "6572"     "577059"   "1092367" 
## [2455] "618798"   "209696"   "1498393"  "15368"    "123412"   "185632"  
## [2461] "80847"    "2764964"  "41418"    "549973"   "6011"     "25370"   
## [2467] "74146"    "176873"   "404617"   "10270"    "155999"   "56713"   
## [2473] "117255"   "298843"   "72513"    "205191"   "73695"    "144879"  
## [2479] "49794"    "609182"   "979"      "53015"    "277794"   "67523"   
## [2485] "17415"    "496399"   "549900"   "1508137"  "244567"   "2401017" 
## [2491] "57033"    "12321"    "122424"   "21507"    "80581"    "37204"   
## [2497] "10643"    "226295"   "57904"    "5157"     "8985"     "16349"   
## [2503] "30291"    "25985"    "60840"    "133573"   "23393"    "198051"  
## [2509] "102594"   "80119"    "26919"    "267189"   "6850"     "18669"   
## [2515] "16420"    "16149"    "71432"    "282460"   "86"       "17941"   
## [2521] "76"       "3614"     "1413"     "3789"     "67"       "107"     
## [2527] "4976"     "3248"     "9293"     "62386"    "162"      "1002"    
## [2533] "498"      "34336"    "9073"     "7505"     "11501"    "2715"    
## [2539] "349"      "806"      "60"       "824"      "6668"     "1940"    
## [2545] "1025"     "5343"     "247"      "970"      "3182"     "1528"    
## [2551] "76795"    "1558437"  "159455"   "2053404"  "892"      "981995"  
## [2557] "11118"    "178934"   "11297"    "15370"    "18194"    "40606"   
## [2563] "133338"   "36900"    "1312037"  "4184"     "100994"   "20008"   
## [2569] "13426"    "18773"    "26941"    "135337"   "7623"     "1422858" 
## [2575] "21404"    "2143"     "189313"   "89868"    "309617"   "3478"    
## [2581] "24349"    "22154"    "17493"    "15966"    "2332"     "29344"   
## [2587] "67772"    "634"      "18425"    "25655305" "7557"     "59089"   
## [2593] "1551"     "12764"    "54807"    "259003"   "121916"   "400"     
## [2599] "3930"     "1032076"  "239242"   "193381"   "7624"     "19738"   
## [2605] "259605"   "18699"    "98819"    "53006"    "436921"   "351168"  
## [2611] "1615596"  "906384"   "5639"     "504823"   "25922"    "5555"    
## [2617] "921868"   "6449"     "15874"    "2093"     "26421"    "2689"    
## [2623] "36969"    "45744"    "714665"   "119202"   "6474426"  "6685"    
## [2629] "28835"    "13205"    "1261"     "1215"     "169965"   "38630"   
## [2635] "249919"   "158196"   "42624"    "26411"    "296781"   "30693"   
## [2641] "83558"    "6066"     "293080"   "54256"    "355921"   "37882"   
## [2647] "190888"   "948198"   "9548"     "1071"     "11908"    "185058"  
## [2653] "697212"   "978"      "877635"   "51684"    "175110"   "11667403"
## [2659] "357944"   "735"      "2543"     "18818"    "30287"    "31504"   
## [2665] "13950"    "21147"    "41490"    "2090"     "44274"    "7006"    
## [2671] "2959"     "40167"    "56807"    "1284017"  "4332"     "10382"   
## [2677] "413999"   "66978"    "66384"    "17703"    "32121"    "350"     
## [2683] "74425"    "131028"   "84957"    "3311"     "23158"    "1380"    
## [2689] "49259"    "185884"   "40975"    "27104"    "247992"   "63647"   
## [2695] "6105"     "30722"    "56524"    "24790"    "13169"    "61692"   
## [2701] "3684"     "18976"    "63020"    "62740"    "76677"    "78154"   
## [2707] "479594"   "201737"   "233305"   "80368"    "23292"    "23641"   
## [2713] "13217"    "17240"    "256680"   "62465"    "103074"   "7232629" 
## [2719] "15681"    "53481"    "104800"   "50459"    "43269"    "56443"   
## [2725] "4928420"  "1421884"  "3652"     "126282"   "14110"    "18710"   
## [2731] "22063"    "286454"   "15922"    "652"      "29768"    "102248"  
## [2737] "41225"    "162049"   "26014"    "53562"    "10562"    "21589"   
## [2743] "3682"     "2909"     "37937"    "15865"    "63920"    "38375"   
## [2749] "16657"    "301413"   "20605"    "16094"    "65590"    "32225"   
## [2755] "44348"    "23279"    "12572"    "70556"    "1240"     "56471"   
## [2761] "6939"     "10218"    "617477"   "1688"     "18857"    "11838"   
## [2767] "60838"    "886418"   "3720"     "9951"     "2398"     "213340"  
## [2773] "1805398"  "94989"    "10247"    "730"      "263454"   "1728557" 
## [2779] "39480"    "1092337"  "4257"     "1221896"  "152658"   "14766"   
## [2785] "1852384"  "353342"   "3053"     "960726"   "1703479"  "8389714" 
## [2791] "867920"   "798522"   "6760"     "13118"    "413609"   "1117212" 
## [2797] "305765"   "154668"   "98324"    "47698"    "133195"   "17069"   
## [2803] "145088"   "46253"    "111741"   "49275"    "29265"    "85763"   
## [2809] "2440695"  "137198"   "22191"    "5637451"  "31061"    "12495"   
## [2815] "512996"   "800"      "2349421"  "50"       "3039889"  "1305050" 
## [2821] "155"      "3042"     "14210"    "291"      "4487182"  "668"     
## [2827] "828489"   "10053186" "472247"   "4624"     "1397944"  "2890"    
## [2833] "27275"    "1531"     "873"      "901110"   "1218"     "88"      
## [2839] "557"      "2387"     "42916526" "686"      "8096"     "499"     
## [2845] "608"      "707"      "376"      "109"      "43"       "2751"    
## [2851] "452"      "4115"     "459"      "306"      "1971777"  "305347"  
## [2857] "408292"   "1736105"  "5091448"  "2588730"  "640974"   "3058687" 
## [2863] "4972230"  "951"      "1591129"  "14026"    "1013867"  "762706"  
## [2869] "2586261"  "91171"    "1076243"  "15301"    "10158"    "4147718" 
## [2875] "1168959"  "4000433"  "23682"    "338449"   "214265"   "187892"  
## [2881] "740"      "15443"    "94910"    "63773"    "1506783"  "1354"    
## [2887] "4082"     "564759"   "951413"   "17350"    "15209"    "16257"   
## [2893] "4660"     "1476"     "75951"    "10374"    "210534"   "9400"    
## [2899] "894435"   "1302"     "4551"     "12726"    "3213548"  "3943"    
## [2905] "1117"     "607"      "9433"     "13096"    "1671658"  "6495"    
## [2911] "20368"    "5427"     "76593"    "183"      "5559"     "127"     
## [2917] "4830407"  "64884"    "857215"   "2065"     "12683"    "1034"    
## [2923] "989344"   "4116"     "7715"     "4710"     "543"      "89"      
## [2929] "777"      "415"      "231"      "517"      "99290"    "1960"    
## [2935] "29544"    "1786"     "6181640"  "145"      "19816"    "4031"    
## [2941] "15439"    "4108"     "65914"    "621"      "3250"     "681"     
## [2947] "10426"    "4140"     "1752017"  "178723"   "2454"     "19758"   
## [2953] "3133"     "7453"     "4635"     "8137"     "1372013"  "3032"    
## [2959] "403911"   "249308"   "29864"    "250257"   "80987"    "896118"  
## [2965] "421000"   "90082"    "83875"    "1838090"  "307398"   "2176"    
## [2971] "56259"    "705805"   "339"      "61264"    "405824"   "797"     
## [2977] "81747"    "609"      "164"      "98"       "934"      "2451136" 
## [2983] "290"      "9140"     "1603"     "1760"     "1656808"  "1017408" 
## [2989] "1185"     "4046"     "1955"     "19047"    "9019"     "4444"    
## [2995] "6418"     "2351"     "42"       "11263"    "4304"     "369"     
## [3001] "264"      "703"      "696019"   "45224"    "576210"   "6263"    
## [3007] "1111915"  "67410"    "5677"     "2965"     "155276"   "11535"   
## [3013] "5525"     "4251"     "1185148"  "90415"    "2852"     "1734"    
## [3019] "222308"   "556659"   "1827212"  "953894"   "31538"    "37234"   
## [3025] "58820"    "758780"   "596"      "1202"     "718"      "901"     
## [3031] "167229"   "2803"     "202"      "189"      "234606"   "128"     
## [3037] "2538"     "156862"   "63680"    "19727"    "4719"     "32597"   
## [3043] "552"      "166033"   "960"      "133180"   "620"      "817"     
## [3049] "313"      "37789"    "3570"     "48929"    "89947"    "466"     
## [3055] "30630"    "7462"     "8600"     "29505"    "106"      "6187"    
## [3061] "659"      "3965"     "4656"     "205"      "1475"     "148826"  
## [3067] "354"      "1699"     "11393"    "401530"   "925"      "671"     
## [3073] "274"      "140"      "30443"    "22401"    "324"      "14832"   
## [3079] "2059"     "826"      "180697"   "589"      "428268"   "298041"  
## [3085] "29"       "230"      "2026"     "86956"    "1129"     "108002"  
## [3091] "213"      "147"      "3062845"  "1162"     "720"      "502"     
## [3097] "1486"     "6627"     "4383"     "680"      "24668"    "13788"   
## [3103] "26893"    "591411"   "2194"     "2012"     "32"       "657"     
## [3109] "4264"     "21107"    "3642"     "495971"   "697939"   "7357"    
## [3115] "944"      "5369"     "135"      "1852"     "6367"     "259"     
## [3121] "5682"     "7687"     "51068"    "2925"     "1655"     "1696"    
## [3127] "11244"    "16771865" "14224"    "5178"     "628"      "12435"   
## [3133] "972574"   "464900"   "15097"    "146913"   "22503"    "1503544" 
## [3139] "5785"     "334"      "16111"    "2789775"  "482630"   "69115"   
## [3145] "38606"    "3044"     "1820"     "10067"    "480"      "2300"    
## [3151] "53144"    "22775"    "370"      "41502"    "963"      "21592"   
## [3157] "103"      "138129"   "6454"     "17988"    "1771"     "8465"    
## [3163] "146"      "21943"    "1468"     "1088"     "29756"    "1057"    
## [3169] "10490"    "16600"    "67611"    "6601"     "233588"   "166886"  
## [3175] "93638"    "83977"    "139"      "784"      "331"      "655"     
## [3181] "3315"     "71"       "1178"     "2158"     "245"      "210"     
## [3187] "568"      "712"      "24517"    "468"      "5599"     "52"      
## [3193] "11404"    "10249"    "906"      "1011"     "4575"     "509"     
## [3199] "93"       "397147"   "109263"   "192677"   "16876"    "2113"    
## [3205] "6121"     "85578"    "165723"   "984451"   "3546"     "430643"  
## [3211] "9879473"  "4288"     "2399"     "582"      "15924"    "3283"    
## [3217] "4016834"  "903"      "87"       "594"      "2772"     "1997"    
## [3223] "309"      "2460"     "1744"     "1667"     "856"      "1704112" 
## [3229] "2142"     "2371543"  "2447"     "352097"   "37607"    "12121"   
## [3235] "101957"   "95080"    "1130966"  "124970"   "1546"     "1092106" 
## [3241] "40617"    "5754"     "54063"    "1166"     "37584"    "1925"    
## [3247] "168487"   "20418"    "58366"    "1216"     "4210"     "12147"   
## [3253] "10806"    "355837"   "22018"    "30515"    "4878"     "252"     
## [3259] "63197"    "32613"    "514"      "1714"     "24210"    "12736"   
## [3265] "586"      "44636"    "3432"     "10748"    "916"      "875"     
## [3271] "1616"     "18612"    "2160"     "5898"     "38517"    "5227"    
## [3277] "281448"   "1131937"  "721646"   "1499466"  "285814"   "351267"  
## [3283] "16936"    "74842"    "45871"    "5180480"  "427185"   "50771"   
## [3289] "7464996"  "4421"     "337242"   "387958"   "229329"   "97071"   
## [3295] "853495"   "5894"     "96028"    "295430"   "121612"   "367951"  
## [3301] "1987"     "650114"   "605"      "41444"    "405"      "3976"    
## [3307] "201"      "11258"    "11408"    "94661"    "71829"    "13604"   
## [3313] "1721"     "24198"    "24697"    "8537"     "180"      "29540"   
## [3319] "63699"    "1886"     "253115"   "55571"    "384602"   "227401"  
## [3325] "596628"   "69279"    "18921"    "1628"     "3908"     "2105"    
## [3331] "2901"     "2634605"  "44939"    "37224"    "829753"   "48253"   
## [3337] "111"      "2371"     "5103"     "630"      "3846378"  "533"     
## [3343] "5967"     "1218055"  "793"      "27501"    "18604"    "12906"   
## [3349] "224"      "40934"    "537"      "4786"     "42529"    "472"     
## [3355] "29854"    "3270"     "402"      "7816"     "409"      "362"     
## [3361] "3043"     "336"      "383"      "3263"     "1041"     "377"     
## [3367] "448"      "166"      "114340"   "66473"    "1450"     "1298"    
## [3373] "332"      "9514"     "152470"   "5107"     "9221"     "10369"   
## [3379] "1586"     "121113"   "14491"    "2976"     "1075"     "208"     
## [3385] "8769"     "3005"     "9914"     "1774"     "705"      "18751"   
## [3391] "25243"    "52677"    "8696"     "10672"    "999"      "204"     
## [3397] "6267"     "40241"    "37302"    "3252"     "317"      "457"     
## [3403] "1555"     "1828"     "1343"     "20178"    "20476"    "39495"   
## [3409] "2171"     "6973"     "8100"     "61746"    "17263"    "30002"   
## [3415] "5262"     "41759"    "4569"     "18277"    "721"      "1776"    
## [3421] "8638"     "1417"     "8581"     "74902"    "6988"     "804"     
## [3427] "263"      "238459"   "17876"    "4726"     "194"      "74531"   
## [3433] "165224"   "503757"   "79792"    "626366"   "38055"    "7479"    
## [3439] "26530"    "76484"    "1450632"  "61990"    "94761"    "52312"   
## [3445] "4931562"  "306652"   "23453"    "50893"    "2215"     "9013"    
## [3451] "6738"     "68"       "375"      "2931"     "536"      "1998"    
## [3457] "1689"     "45458"    "486"      "949"      "80927"    "116973"  
## [3463] "3597"     "25627"    "40"       "55313"    "41624"    "1891"    
## [3469] "65766"    "346681"   "1407"     "2975"     "1564"     "6750"    
## [3475] "3873"     "74744"    "107441"   "18893"    "26916"    "1031"    
## [3481] "345"      "585"      "734"      "165"      "807226"   "322"     
## [3487] "23971"    "145931"   "243121"   "33812"    "82827"    "803"     
## [3493] "218451"   "1291"     "57400"    "284670"   "445756"   "361780"  
## [3499] "41608"    "24456"    "2319"     "7878"     "6696"     "4832"    
## [3505] "173394"   "2654"     "96419"    "13620"    "12322"    "29551"   
## [3511] "47463"    "1340"     "55011"    "7116"     "759"      "497826"  
## [3517] "1777"     "518"      "1201"     "1133393"  "3547"     "1867"    
## [3523] "208501"   "4581"     "1313"     "1493"     "4537"     "5849"    
## [3529] "327"      "33944"    "10256"    "8004"     "636228"   "245839"  
## [3535] "68072"    "210317"   "1333338"  "899748"   "23729"    "406511"  
## [3541] "179139"   "104389"   "559931"   "130689"   "234971"   "228737"  
## [3547] "72202"    "2570"     "14145"    "21223"    "438911"   "2063"    
## [3553] "45771"    "7148"     "19209"    "29495"    "1721943"  "8219586" 
## [3559] "690148"   "579519"   "170973"   "62636"    "123136"   "250197"  
## [3565] "221722"   "29838"    "1012"     "183004"   "143087"   "652940"  
## [3571] "336386"   "5195"     "91935"    "92522"    "45370"    "41269"   
## [3577] "394842"   "1008012"  "1231"     "1443"     "148083"   "31596"   
## [3583] "598975"   "64164"    "16063"    "684"      "177"      "188"     
## [3589] "11018"    "676"      "70"       "242"      "3743"     "129409"  
## [3595] "22667"    "10114"    "48427"    "4704"     "265"      "41000"   
## [3601] "161637"   "349151"   "169369"   "218881"   "31134"    "242722"  
## [3607] "244039"   "720685"   "84389"    "137696"   "322976"   "254"     
## [3613] "898"      "22435"    "16801"    "391325"   "28735"    "580160"  
## [3619] "862"      "1226514"  "19070"    "20247"    "1363"     "1602"    
## [3625] "1017237"  "211"      "410"      "1745"     "469"      "25195"   
## [3631] "9636"     "153"      "1152"     "36268"    "20879"    "76340"   
## [3637] "2563"     "4374"     "23966"    "961"      "246201"   "125652"  
## [3643] "11379"    "2057"     "42329"    "344283"   "258717"   "40437"   
## [3649] "51787"    "531074"   "480208"   "29867"    "450013"   "228130"  
## [3655] "155693"   "81668"    "11773"    "870928"   "407788"   "118285"  
## [3661] "326232"   "318134"   "2445"     "65119"    "34898"    "889425"  
## [3667] "1041836"  "17945"    "47151"    "107765"   "37165"    "174127"  
## [3673] "4706"     "73919"    "532"      "344819"   "9335"     "100805"  
## [3679] "23168"    "2628"     "235486"   "2717"     "24123"    "7728"    
## [3685] "2180"     "59223"    "16162"    "1976"     "6698"     "21266"   
## [3691] "4041"     "2691"     "27856"    "9126"     "568922"   "85278"   
## [3697] "16521"    "91667"    "664"      "2433"     "1827"     "762"     
## [3703] "23609"    "24312"    "874"      "29462"    "159063"   "294"     
## [3709] "6207063"  "631"      "617"      "215"      "1275373"  "1434"    
## [3715] "701"      "2420"     "251"      "1170641"  "150"      "2598579" 
## [3721] "574719"   "78629"    "648380"   "7317"     "18325"    "725"     
## [3727] "7718"     "249"      "29062"    "484"      "1948"     "35572"   
## [3733] "20973"    "1604146"  "260137"   "2079"     "8346"     "7264"    
## [3739] "6205"     "19666"    "2808"     "24775"    "3845"     "244"     
## [3745] "15765"    "305"      "141"      "4546"     "924"      "569"     
## [3751] "15806"    "94"       "6344"     "10446"    "1213"     "649"     
## [3757] "72"       "489"      "2954"     "9895"     "162530"   "39779"   
## [3763] "254518"   "88901"    "65146"    "104551"   "66321"    "29270"   
## [3769] "751911"   "1520"     "11087"    "26426"    "2728"     "132"     
## [3775] "463"      "386"      "5300"     "226"      "1491"     "592"     
## [3781] "181"      "690"      "677"      "79826"    "3647"     "242096"  
## [3787] "3452530"  "1424"     "169661"   "172"      "7441"     "36151"   
## [3793] "113"      "8343"     "209"      "27135"    "6230"     "137377"  
## [3799] "7461"     "39109"    "5988"     "10341"    "193"      "289"     
## [3805] "540"      "4496"     "1238"     "474"      "4071"     "3347"    
## [3811] "983"      "1988"     "3491"     "4396"     "192851"   "8418"    
## [3817] "9443"     "787"      "838"      "2806"     "66033"    "54090"   
## [3823] "3103"     "48545"    "3451011"  "618562"   "4330"     "484226"  
## [3829] "28390"    "1580"     "556929"   "2520"     "559597"   "305367"  
## [3835] "128367"   "103909"   "76608"    "275843"   "263907"   "1849"    
## [3841] "3066"     "262"      "28892"    "634159"   "428"      "1542"    
## [3847] "17067"    "3322"     "2509"     "447"      "29798"    "412"     
## [3853] "407"      "1073"     "63056"    "643"      "81219"    "694"     
## [3859] "35188"    "31883"    "101762"   "28660"    "20101"    "10440"   
## [3865] "267378"   "2017"     "142693"   "2076"     "7118"     "34079"   
## [3871] "4334"     "1644"     "1630"     "10758"    "9612"     "5442"    
## [3877] "205830"   "3049"     "867"      "14453"    "6079"     "167406"  
## [3883] "128579"   "389"      "8175"     "8114"     "4027"     "17180"   
## [3889] "286"      "675"      "219"      "577"      "7420"     "5055"    
## [3895] "3640"     "1819"     "2312084"  "282"      "284795"   "5644"    
## [3901] "63765"    "1118201"  "899010"   "205914"   "1042170"  "42729"   
## [3907] "29944"    "212067"   "155694"   "344383"   "4099"     "4722"    
## [3913] "501"      "616"      "4010"     "325"      "35121"    "341"     
## [3919] "381023"   "26601"    "414"      "68664"    "524467"   "52199"   
## [3925] "6333"     "956"      "969"      "1469"     "2066"     "26744"   
## [3931] "802"      "41747"    "19221"    "2448"     "5793284"  "549"     
## [3937] "445"      "301"      "222"      "4160"     "2614"     "2683"    
## [3943] "3379"     "33783"    "28447"    "8419"     "4205"     "1563"    
## [3949] "355"      "1901"     "6073"     "7326"     "1318"     "855"     
## [3955] "320"      "9765"     "2710"     "622"      "1879"     "8827"    
## [3961] "11760"    "328"      "8185"     "661"      "2539"     "913"     
## [3967] "597"      "33661"    "46801"    "5591653"  "1432447"  "1167143" 
## [3973] "125578"   "14283"    "20675"    "51569"    "38297"    "159398"  
## [3979] "776730"   "152"      "70903"    "4451317"  "66740"    "186"     
## [3985] "7808"     "174423"   "5661"     "157"      "2256"     "2526"    
## [3991] "5004"     "191"      "67186"    "8259"     "3068"     "13466"   
## [3997] "27749"    "25515"    "28030"    "75719"    "1103"     "21804"   
## [4003] "20691"    "232"      "5865"     "544"      "2068"     "1902"    
## [4009] "1418"     "2736"     "1016"     "1304467"  "1167"     "245104"  
## [4015] "5879"     "138337"   "947515"   "5546"     "13304"    "4585"    
## [4021] "9966"     "70335"    "269"      "11480"    "1895"     "847"     
## [4027] "19640"    "333"      "25205"    "48451"    "15665"    "3703"    
## [4033] "2794"     "632"      "1307"     "641219"   "1926"     "437"     
## [4039] "20829"    "11187"    "2992"     "636"      "21979"    "37122"   
## [4045] "22167"    "238"      "2838064"  "4015"     "698"      "15057"   
## [4051] "10198"    "132792"   "4977"     "45558"    "19784"    "10006"   
## [4057] "246705"   "2719"     "547"      "329"      "2839"     "2048"    
## [4063] "1376"     "84114"    "13064"    "551"      "38448"    "22782"   
## [4069] "85882"    "51067"    "13005"    "396"      "41693"    "138739"  
## [4075] "91645"    "107497"   "20535"    "7664"     "25952"    "51791"   
## [4081] "25744"    "54221"    "2231"     "412744"   "12784"    "2586"    
## [4087] "992"      "24557"    "16041"    "267229"   "55723"    "14356"   
## [4093] "997"      "23474"    "12293"    "3588"     "767"      "1279"    
## [4099] "22290"    "34279"    "7342"     "5706"     "183343"   "5481"    
## [4105] "769"      "45452"    "112223"   "2100"     "12034"    "1977"    
## [4111] "62561"    "9016"     "41625"    "8433"     "421"      "555"     
## [4117] "19212"    "31621"    "688"      "7000"     "5463"     "17998"   
## [4123] "21785"    "1308"     "4538"     "522"      "201426"   "26138"   
## [4129] "51523"    "6477"     "1008"     "1692"     "1143"     "3433"    
## [4135] "4923"     "885187"   "113951"   "658087"   "4901"     "2132"    
## [4141] "175"      "1432809"  "293"      "51366"    "170"      "7314"    
## [4147] "116"      "2746"     "4379"     "3146"     "179"      "14153"   
## [4153] "21866"    "11514"    "9950"     "1045"     "51110"    "17861"   
## [4159] "1647"     "271"      "1947"     "4928"     "13232"    "1691"    
## [4165] "108"      "5509"     "1623"     "235"      "2599"     "52390"   
## [4171] "4218587"  "288835"   "221"      "29330"    "29990"    "348962"  
## [4177] "1995"     "2099"     "1022"     "4074"     "154519"   "207"     
## [4183] "12414"    "401643"   "9898"     "3371"     "39038"    "9403"    
## [4189] "17372"    "7885"     "1364"     "155649"   "562"      "206602"  
## [4195] "219821"   "1415"     "144545"   "349503"   "20001"    "2480"    
## [4201] "10774"    "5025"     "5618"     "2233681"  "385764"   "151095"  
## [4207] "488039"   "24900999" "23802"    "2246379"  "85496"    "262076"  
## [4213] "71740"    "3341"     "843"      "2889"     "112482"   "3989"    
## [4219] "8190074"  "85079"    "953790"   "10093"    "1500"     "1873"    
## [4225] "1147"     "1264"     "246"      "7519"     "2910"     "373"     
## [4231] "6271"     "24226"    "2359"     "1985"     "3863"     "20463"   
## [4237] "920571"   "2263"     "5395"     "30444"    "256916"   "37513"   
## [4243] "40847"    "96045"    "21433"    "82005"    "120035"   "6386"    
## [4249] "43314"    "24628"    "39698"    "121533"   "10776"    "13519"   
## [4255] "39153"    "51973"    "48754"    "7046"     "773"      "117461"  
## [4261] "108336"   "3904"     "14563"    "17786"    "709"      "81001"   
## [4267] "581"      "32416"    "8038"     "420"      "6947"     "26572"   
## [4273] "248417"   "102215"   "15911"    "9465"     "4607"     "92958"   
## [4279] "1018"     "7935"     "132282"   "2548"     "34443"    "36557"   
## [4285] "475020"   "59152"    "2871"     "19207"    "1044"     "28237"   
## [4291] "3247"     "119368"   "4552"     "108130"   "293086"   "8778"    
## [4297] "695"      "7063"     "37975"    "117850"   "167974"   "2630"    
## [4303] "4756"     "323"      "14754"    "7389"     "92010"    "68226"   
## [4309] "3446"     "297"      "3106"     "6078"     "561"      "77717"   
## [4315] "40328"    "1502622"  "22773"    "26358"    "52163"    "299046"  
## [4321] "41089"    "1019"     "2328"     "433"      "4946"     "16459"   
## [4327] "14823"    "8668"     "112565"   "14604"    "3253"     "431"     
## [4333] "3988"     "8191"     "693"      "347"      "853"      "6208"    
## [4339] "2374"     "3661"     "1992"     "922"      "4290"     "495"     
## [4345] "28136"    "12667"    "40907"    "475944"   "115176"   "121082"  
## [4351] "15830"    "3757"     "9555"     "770"      "2147"     "3972"    
## [4357] "2303"     "20921"    "27820"    "11310"    "715"      "593"     
## [4363] "416"      "6997"     "996"      "2555"     "165656"   "3935"    
## [4369] "12204"    "29229"    "31552"    "12639"    "1436"     "4513"    
## [4375] "6784"     "268"      "1137"     "3069"     "20843"    "137674"  
## [4381] "50428"    "1419"     "4069"     "439"      "818"      "6060"    
## [4387] "75140"    "387"      "16851"    "541732"   "47340"    "81502"   
## [4393] "208543"   "85015"    "2250"     "6200"     "11179"    "1514"    
## [4399] "470"      "876"      "32014"    "1189"     "2802"     "1740"    
## [4405] "1372"     "15753"    "1641"     "4228"     "9307"     "1151"    
## [4411] "12400"    "528550"   "35337"    "2674051"  "766"      "122010"  
## [4417] "136540"   "134412"   "327914"   "38824"    "171017"   "6289"    
## [4423] "49523"    "260527"   "419375"   "2464"     "93898"    "42190"   
## [4429] "420518"   "39895"    "794058"   "119685"   "88993"    "6514"    
## [4435] "954"      "355613"   "504765"   "2440877"  "6715"     "2557"    
## [4441] "937"      "139545"   "22333"    "45651"    "16190"    "304106"  
## [4447] "1186"     "258"      "264755"   "2354042"  "6895"     "1440"    
## [4453] "186648"   "226541"   "74497"    "264260"   "50109"    "54034"   
## [4459] "47386"    "401211"   "498894"   "9149"     "191621"   "32849"   
## [4465] "251686"   "252006"   "28694"    "785622"   "113183"   "951435"  
## [4471] "45610"    "257531"   "198480"   "375996"   "152102"   "17108"   
## [4477] "1764"     "3408"     "11100"    "1648"     "65448"    "16678"   
## [4483] "10318"    "53301"    "195"      "10786"    "407589"   "104068"  
## [4489] "562345"   "3941129"  "314774"   "1878"     "78142"    "326042"  
## [4495] "35172"    "28633"    "331692"   "3527"     "151"      "1130"    
## [4501] "6450"     "29387"    "38767"    "5623"     "157997"   "83545"   
## [4507] "34062"    "1484"     "152395"   "3715"     "975"      "396090"  
## [4513] "466495"   "41683"    "96658"    "227798"   "303394"   "39068"   
## [4519] "3909032"  "6026"     "41331"    "28107"    "217736"   "1648515" 
## [4525] "55952"    "928720"   "609186"   "771001"   "617732"   "332623"  
## [4531] "371318"   "216513"   "696665"   "860078"   "5997"     "796"     
## [4537] "2013"     "21095"    "429"      "1387"     "6752"     "708"     
## [4543] "19388"    "851"      "6505"     "569727"   "91186"    "1072565" 
## [4549] "120494"   "637"      "43677"    "79132"    "39682"    "18478"   
## [4555] "32879"    "34612"    "253207"   "23348"    "46242"    "40467"   
## [4561] "16192"    "148715"   "24565"    "59632"    "8780"     "38607"   
## [4567] "942"      "5985"     "426"      "505"      "794"      "905"     
## [4573] "820"      "11872"    "69013"    "364013"   "13079"    "4856"    
## [4579] "3745"     "2032"     "456474"   "267395"   "45359"    "25427"   
## [4585] "14432"    "54520"    "253155"   "154108"   "72161"    "43088"   
## [4591] "6320"     "271214"   "14089"    "26452"    "6120"     "7801"    
## [4597] "57449"    "7566"     "4649"     "10484"    "2537"     "4441"    
## [4603] "86172"    "7969"     "56664"    "2295"     "1290"     "15618"   
## [4609] "11402"    "1007"     "8193"     "1115"     "1853"     "1283"    
## [4615] "4813"     "3003"     "666"      "12111"    "8432"     "7812"    
## [4621] "9659"     "2576"     "3358"     "1911"     "28140"    "5485"    
## [4627] "11250"    "5093"     "8450"     "13492"    "2362"     "139432"  
## [4633] "1638"     "7896"     "58575"    "32881"    "441"      "475369"  
## [4639] "358633"   "1094094"  "23347"    "1626"     "36578"    "14253"   
## [4645] "15829"    "101738"   "372553"   "6716"     "3345"     "200450"  
## [4651] "42182"    "2700"     "2310"     "594406"   "85317"    "1013"    
## [4657] "2311"     "97209"    "4518"     "3580"     "11748"    "1205"    
## [4663] "138872"   "1091"     "11788"    "23022"    "3725"     "1357"    
## [4669] "19291"    "87766"    "14002"    "22896"    "107778"   "10676"   
## [4675] "2596"     "71852"    "4908"     "7335"     "12443"    "2046"    
## [4681] "3227"     "368"      "2962"     "768"      "21423"    "22382"   
## [4687] "877576"   "22032"    "162933"   "477"      "190086"   "63779"   
## [4693] "1312936"  "783025"   "4328"     "137338"   "199739"   "6231"    
## [4699] "24985"    "7578"     "13479633" "436615"   "465831"   "67707"   
## [4705] "4847"     "21762"    "20941"    "93930"    "5174"     "935"     
## [4711] "19234"    "7605"     "73821"    "12993"    "868"      "14051"   
## [4717] "4595"     "3390"     "500"      "23599"    "101163"   "321"     
## [4723] "128808"   "1048766"  "1251479"  "5692"     "6827"     "1522"    
## [4729] "17671"    "902"      "21186"    "18298"    "298"      "3375"    
## [4735] "6697"     "6156"     "5964"     "3195"     "281"      "15068"   
## [4741] "449"      "1894"     "1763"     "120852"   "2807"     "2318"    
## [4747] "2058"     "10355"    "3606"     "1060"     "417"      "580"     
## [4753] "2371338"  "1690802"  "3344300"  "1455952"  "282727"   "636995"  
## [4759] "567632"   "1461698"  "69574"    "32344"    "199808"   "203101"  
## [4765] "56444"    "201631"   "32600"    "93870"    "222664"   "166251"  
## [4771] "70389"    "172281"   "567984"   "820577"   "32496"    "2903386" 
## [4777] "85484"    "4490"     "38419"    "42767"    "140658"   "33178"   
## [4783] "585564"   "1172"     "2915"     "318142"   "5341"     "1111"    
## [4789] "627"      "3429"     "1730"     "30350"    "55408"    "633"     
## [4795] "1611"     "3776"     "4114"     "403"      "2210"     "1242"    
## [4801] "435"      "5969"     "311"      "22165"    "883"      "6547"    
## [4807] "42497"    "24278"    "7993"     "1309728"  "3946"     "17453"   
## [4813] "14773"    "2282"     "8369"     "1506"     "30008"    "11002"   
## [4819] "47576"    "2533"     "1880"     "7379"     "20977"    "13388"   
## [4825] "35746"    "137167"   "7203"     "267636"   "1294"     "234"     
## [4831] "16073"    "64815"    "5075"     "125616"   "3593"     "32522"   
## [4837] "1036"     "27130"    "157264"   "157322"   "76627"    "5290"    
## [4843] "1094"     "326689"   "6969"     "138050"   "105766"   "80313"   
## [4849] "4231"     "42515"    "20620"    "4600"     "9051"     "14692"   
## [4855] "40678"    "59096"    "127810"   "27179"    "351607"   "81543"   
## [4861] "31970"    "366"      "13731"    "104583"   "58553"    "6669"    
## [4867] "764967"   "5339"     "6735"     "23104"    "13258"    "48731"   
## [4873] "8482"     "3356"     "13714"    "35989"    "16237"    "21149"   
## [4879] "20292"    "140883"   "16426"    "32812"    "8091"     "352"     
## [4885] "3048"     "3136"     "467"      "34417"    "5482"     "1961"    
## [4891] "697805"   "182363"   "192374"   "4871"     "4234"     "49210"   
## [4897] "6667"     "31705"    "94308"    "848"      "751"      "595"     
## [4903] "6133"     "840"      "8011"     "2436"     "20421"    "5933"    
## [4909] "36968"    "3654"     "23302"    "1914"     "23805"    "526595"  
## [4915] "279917"   "4254"     "161423"   "2536"     "56197"    "80904"   
## [4921] "207440"   "93608"    "244797"   "7904"     "3175"     "1123190" 
## [4927] "49381"    "2102"     "7972"     "4416"     "3543"     "190274"  
## [4933] "2850"     "12846"    "2714"     "6143"     "32831"    "2941"    
## [4939] "197979"   "2208"     "1986068"  "71468"    "910051"   "6673"    
## [4945] "114479"   "1489"     "8608"     "14989"    "7107"     "127831"  
## [4951] "211308"   "382100"   "1317"     "200"      "291941"   "187200"  
## [4957] "246538"   "177542"   "26202"    "750321"   "6577"     "711719"  
## [4963] "47069"    "922752"   "7243"     "57076"    "7881"     "44233"   
## [4969] "291901"   "121304"   "5118"     "24175"    "135739"   "162831"  
## [4975] "10773"    "61600"    "75566"    "384368"   "71328"    "15105"   
## [4981] "34514"    "151374"   "2062"     "50060"    "188740"   "43852"   
## [4987] "13330"    "46369"    "29155"    "2200"     "5449"     "5731"    
## [4993] "11023"    "5291"     "39661"    "8014"     "141515"   "254861"  
## [4999] "19601"    "1938"     "519"      "1652"     "39647"    "12781"   
## [5005] "980"      "3482"     "2801"     "173"      "2936"     "2510"    
## [5011] "4798"     "1909"     "175509"   "10449"    "230564"   "81614"   
## [5017] "171771"   "5867"     "32207"    "23859"    "14428"    "4737"    
## [5023] "7851"     "20865"    "77609"    "11449"    "940"      "3509"    
## [5029] "380"      "669901"   "7252"     "337532"   "584070"   "1553"    
## [5035] "1303"     "93708"    "24980"    "1877"     "932870"   "24137"   
## [5041] "1536512"  "398746"   "70883"    "163679"   "1719"     "201718"  
## [5047] "49971"    "1260143"  "522205"   "174215"   "112384"   "108169"  
## [5053] "1260903"  "95201"    "223941"   "444"      "442"      "1865"    
## [5059] "66894"    "214923"   "606"      "43045"    "4501"     "891"     
## [5065] "20807"    "2649"     "2558"     "6747"     "5886"     "17543"   
## [5071] "36813"    "9389"     "59660"    "2338"     "2822"     "539931"  
## [5077] "2223"     "1032"     "2233"     "13752"    "903392"   "9513"    
## [5083] "27557"    "127229"   "3071"     "728"      "2000"     "256219"  
## [5089] "900"      "24091"    "70105"    "13253"    "84779"    "83891"   
## [5095] "25489"    "2294"     "568391"   "77585"    "9699"     "28728"   
## [5101] "111634"   "2645"     "8723"     "2115"     "2695"     "28151"   
## [5107] "1092"     "18280"    "11510"    "3258"     "1550"     "8894"    
## [5113] "70351"    "60298"    "171220"   "5474"     "1327"     "499483"  
## [5119] "239"      "19877"    "58387"    "54082"    "497"      "7443"    
## [5125] "3674"     "43191"    "936"      "32112"    "6801"     "2187"    
## [5131] "280"      "1276"     "1461"     "626"      "18926"    "3344"    
## [5137] "789"      "3696"     "7300"     "6849"     "260"      "2412"    
## [5143] "3187"     "1212"     "9716"     "201537"   "11051"    "38826"   
## [5149] "264282"   "15070"    "100179"   "318867"   "172373"   "25438"   
## [5155] "283823"   "3895"     "9562"     "42621"    "710"      "10743"   
## [5161] "76498"    "37090"    "6396"     "5285"     "68025"    "3047"    
## [5167] "9606"     "3840"     "1905"     "15221"    "1749"     "5229"    
## [5173] "342"      "670"      "37140"    "761"      "974"      "795"     
## [5179] "2019"     "758590"   "15883"    "401"      "91397"    "1526"    
## [5185] "5675"     "5015"     "41074"    "2317"     "981"      "729"     
## [5191] "4354"     "1463"     "13819"    "2378"     "3704"     "1539"    
## [5197] "20755"    "407694"   "38487"    "8649"     "43645"    "44027"   
## [5203] "83671"    "85410"    "839206"   "20364"    "1228"     "26347"   
## [5209] "71476"    "995002"   "29415"    "79464"    "72522"    "2980"    
## [5215] "28429"    "60097"    "3502"     "9296"     "12257"    "12919"   
## [5221] "8122"     "221691"   "320334"   "73185"    "18584"    "687136"  
## [5227] "1063"     "2027"     "26545"    "70753"    "21735"    "2917"    
## [5233] "83427"    "615"      "424"      "15036"    "4838"     "481"     
## [5239] "37277"    "13469"    "2390"     "479"      "3673"     "691"     
## [5245] "1711"     "346"      "2372"     "42432"    "30668"    "26224"   
## [5251] "105954"   "38473"    "452589"   "2183"     "156322"   "35171"   
## [5257] "2505"     "660613"   "22570"    "16282"    "103199"   "33788"   
## [5263] "1517"     "129542"   "12700"    "301895"   "1211"     "25275"   
## [5269] "1456"     "47688"    "871"      "1481"     "148"      "1338"    
## [5275] "340"      "763"      "1566"     "14221"    "1351833"  "3023"    
## [5281] "11460"    "88941"    "11235"    "2207"     "85468"    "36183"   
## [5287] "55014"    "275048"   "1916"     "1519671"  "153176"   "7279"    
## [5293] "61392"    "3471"     "68358"    "162564"   "9183"     "111809"  
## [5299] "26252"    "271908"   "332083"   "121321"   "3268"     "9894"    
## [5305] "316378"   "8484"     "2531"     "422"      "404"      "939"     
## [5311] "834117"   "245455"   "455"      "1035"     "1287"     "21661"   
## [5317] "28510"    "7339"     "61445"    "32433"    "2036"     "56496"   
## [5323] "376223"   "785"      "5775"     "885"      "88486"    "603"     
## [5329] "1195"     "398307"
print(unique_values1)
##  [1] 4.100000 3.900000 4.700000 4.500000 4.300000 4.400000 3.800000 4.200000
##  [9] 4.600000 3.200000 4.000000 4.173243 4.800000 4.900000 3.600000 3.700000
## [17] 3.300000 3.400000 3.500000 3.100000 5.000000 2.600000 3.000000 1.900000
## [25] 2.500000 2.800000 2.700000 1.000000 2.900000 2.300000 2.200000 1.700000
## [33] 2.000000 1.800000 2.400000 1.600000 2.100000 1.400000 1.500000 1.200000

Change the column reviews from Str to int

## 'data.frame':    9659 obs. of  13 variables:
##  $ App           : chr  "Photo Editor & Candy Camera & Grid & ScrapBook" "Coloring book moana" "U Launcher Lite – FREE Live Cool Themes, Hide Apps" "Sketch - Draw & Paint" ...
##  $ Category      : chr  "ART_AND_DESIGN" "ART_AND_DESIGN" "ART_AND_DESIGN" "ART_AND_DESIGN" ...
##  $ Rating        : num  4.1 3.9 4.7 4.5 4.3 4.4 3.8 4.1 4.4 4.7 ...
##  $ Reviews       : num  159 967 87510 215644 967 ...
##  $ Size          : num  19 14 8.7 25 2.8 5.6 19 29 33 3.1 ...
##  $ Installs      : num  1e+04 5e+05 5e+06 5e+07 1e+05 5e+04 5e+04 1e+06 1e+06 1e+04 ...
##  $ Type          : chr  "Free" "Free" "Free" "Free" ...
##  $ Price         : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Content.Rating: chr  "Everyone" "Everyone" "Everyone" "Teen" ...
##  $ Genres        : chr  "Art & Design" "Art & Design;Pretend Play" "Art & Design" "Art & Design" ...
##  $ Last.Updated  : chr  "January 7, 2018" "January 15, 2018" "August 1, 2018" "June 8, 2018" ...
##  $ Current.Ver   : chr  "1.0.0" "2.0.0" "1.2.4" "Varies with device" ...
##  $ Android.Ver   : chr  "4.0.3 and up" "4.0.3 and up" "4.0.3 and up" "4.2 and up" ...
Table: Statistics summary.
App Category Rating Reviews Size Installs Type Price Content.Rating Genres Last.Updated Current.Ver Android.Ver
Min Length:9659 Length:9659 Min. :1.000 Min. : 0 Min. : 0.0085 Min. :0.000e+00 Length:9659 Min. : 0.000 Length:9659 Length:9659 Length:9659 Length:9659 Length:9659
Q1 Class :character Class :character 1st Qu.:4.000 1st Qu.: 25 1st Qu.: 5.3000 1st Qu.:1.000e+03 Class :character 1st Qu.: 0.000 Class :character Class :character Class :character Class :character Class :character
Median Mode :character Mode :character Median :4.200 Median : 967 Median : 13.1000 Median :1.000e+05 Mode :character Median : 0.000 Mode :character Mode :character Mode :character Mode :character Mode :character
Mean NA NA Mean :4.173 Mean : 216593 Mean : 20.1512 Mean :7.778e+06 NA Mean : 1.099 NA NA NA NA NA
Q3 NA NA 3rd Qu.:4.500 3rd Qu.: 29401 3rd Qu.: 27.0000 3rd Qu.:1.000e+06 NA 3rd Qu.: 0.000 NA NA NA NA NA
Max NA NA Max. :5.000 Max. :78158306 Max. :100.0000 Max. :1.000e+09 NA Max. :400.000 NA NA NA NA NA

There are 1463 missing values in rating.

As it could observed the Family category apps have the highest NA values. Let’s not drop them but handle them by replacing with the mean value for the category.

Checking for Outliers For rating by seeing frequency for each rating

 breaks = seq(15,20,by = 1)
frequency_table = table(data_clean$Rating)
frequency_table
## 
##                1              1.2              1.4              1.5 
##               16                1                3                3 
##              1.6              1.7              1.8              1.9 
##                4                8                8               11 
##                2              2.1              2.2              2.3 
##               12                8               14               20 
##              2.4              2.5              2.6              2.7 
##               19               20               24               23 
##              2.8              2.9                3              3.1 
##               40               45               81               69 
##              3.2              3.3              3.4              3.5 
##               63              100              126              156 
##              3.6              3.7              3.8              3.9 
##              167              224              286              359 
##                4              4.1 4.17324304538799              4.2 
##              513              621             1463              810 
##              4.3              4.4              4.5              4.6 
##              897              895              848              683 
##              4.7              4.8              4.9                5 
##              442              221               85              271

From above it can be seen all the rating are between 1 and 5.But, most of them are above 4

Replacing NA values in Rating with mean

#Replace NA in Ratings with Overall Mean
data_clean <- data_clean %>%
  mutate(Rating = ifelse(is.na(Rating), mean(Rating, na.rm = TRUE), Rating))

xkablesummary(data_clean)
Table: Statistics summary.
App Category Rating Reviews Size Installs Type Price Content.Rating Genres Last.Updated Current.Ver Android.Ver
Min Length:9659 Length:9659 Min. :1.000 Min. : 0 Min. : 0.0085 Min. :0.000e+00 Length:9659 Min. : 0.000 Length:9659 Length:9659 Length:9659 Length:9659 Length:9659
Q1 Class :character Class :character 1st Qu.:4.000 1st Qu.: 25 1st Qu.: 5.3000 1st Qu.:1.000e+03 Class :character 1st Qu.: 0.000 Class :character Class :character Class :character Class :character Class :character
Median Mode :character Mode :character Median :4.200 Median : 967 Median : 13.1000 Median :1.000e+05 Mode :character Median : 0.000 Mode :character Mode :character Mode :character Mode :character Mode :character
Mean NA NA Mean :4.173 Mean : 216593 Mean : 20.1512 Mean :7.778e+06 NA Mean : 1.099 NA NA NA NA NA
Q3 NA NA 3rd Qu.:4.500 3rd Qu.: 29401 3rd Qu.: 27.0000 3rd Qu.:1.000e+06 NA 3rd Qu.: 0.000 NA NA NA NA NA
Max NA NA Max. :5.000 Max. :78158306 Max. :100.0000 Max. :1.000e+09 NA Max. :400.000 NA NA NA NA NA

Now there are no missing values in reviews.

Category

# Checking the type of the Category 
typeof(data_apps$Category)
## [1] "character"
length(unique(data_clean$Category))
## [1] 33
length(unique(data_clean$Genres))
## [1] 118

There are 33 categories in the the data frame with 118 genres. This means that in each category, there are multiple genres. Given that, the later analyses in this project can be proceeded with Category variable.

Below is the graph for the distribution of Categories for the dataset after removing duplicates.

Current Version & Genres

Due to the inconsistent formatting of values in the Current.Ver column, this column is dropped and will be excluded from the analysis.

data_final <- data_clean %>% select(-c('Genres', 'Current.Ver'))
data_final$Category <- as.factor(data_final$Category)
data_final$Android.Ver <- as.factor(data_final$Android.Ver)

Content Rating, Last Updated

# Remove leading and trailing spaces and convert all text to a consistent format 
data_final$Content.Rating <- trimws(tolower(data_final$Content.Rating))

cr_missing <- sum(is.na(data_final$`Content Rating`))

print(paste("Number of missing values in 'Content Rating':", cr_missing))
## [1] "Number of missing values in 'Content Rating': 0"

There are no missing values for Content rating.

# Convert Last Updated to Date format
data_final$Last.Updated <- as.Date(data_final$Last.Updated, format = "%B %d, %Y")

# Verify the cleaning
print("\nSummary of Last.Updated after cleaning:")
## [1] "\nSummary of Last.Updated after cleaning:"
print(summary(data_clean$Last.Updated))
##    Length     Class      Mode 
##      9659 character character

After cleaning the Data

str(data_final)
## 'data.frame':    9659 obs. of  11 variables:
##  $ App           : chr  "Photo Editor & Candy Camera & Grid & ScrapBook" "Coloring book moana" "U Launcher Lite – FREE Live Cool Themes, Hide Apps" "Sketch - Draw & Paint" ...
##  $ Category      : Factor w/ 33 levels "ART_AND_DESIGN",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Rating        : num  4.1 3.9 4.7 4.5 4.3 4.4 3.8 4.1 4.4 4.7 ...
##  $ Reviews       : num  159 967 87510 215644 967 ...
##  $ Size          : num  19 14 8.7 25 2.8 5.6 19 29 33 3.1 ...
##  $ Installs      : num  1e+04 5e+05 5e+06 5e+07 1e+05 5e+04 5e+04 1e+06 1e+06 1e+04 ...
##  $ Type          : chr  "Free" "Free" "Free" "Free" ...
##  $ Price         : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Content.Rating: chr  "everyone" "everyone" "everyone" "teen" ...
##  $ Last.Updated  : Date, format: "2018-01-07" "2018-01-15" ...
##  $ Android.Ver   : Factor w/ 34 levels "1.0 and up","1.5 and up",..: 16 16 16 19 21 9 16 19 11 16 ...

Data Exploring and Visualization

Visualization for Price Distribution

# Count Plot for the Price distribution
ggplot(data_final, aes(x=Price)) +
  geom_histogram(binwidth=2, fill="pink", color="black") +
   xlim(0, 500) + ylim(0, 500) +
  labs(title="Price Distribution", x="Price", y="Frequency") +
  theme_minimal()

The data is highly skewed as there are many zero price entries.

# Boxplot for the same
ggplot(data_final, aes(y=Price)) +
  geom_boxplot(outlier.colour = "red", outlier.shape = 16, outlier.size = 1, fill="pink", color="black") +
  labs(title="Price Boxplot", y="Price") +
  theme_minimal()

Checking outliers for Price

outlierKD2 <- function(df, var, rm = FALSE, boxplt = FALSE, histogram = TRUE, qqplt = FALSE) {
  dt <- df  # Duplicate the dataframe for potential alteration
  var_name <- eval(substitute(var), eval(dt))
  na1 <- sum(is.na(var_name))
  m1 <- mean(var_name, na.rm = TRUE)
  colTotal <- boxplt + histogram + qqplt  # Calculate the total number of charts to be displayed
  par(mfrow = c(2, max(2, colTotal)), oma = c(0, 0, 3, 0))  # Adjust layout for plots
  
  # Q-Q plot with custom title
  if (qqplt) {
    qqnorm(var_name, main="Q-Q plot without Outliers")
    qqline(var_name)
  }
  
  # Histogram with custom title
  if (histogram) { 
    hist(var_name,main = "Histogram without Outliers", xlab = NA, ylab = NA) 
  }
  
  # Box plot with custom title
  if (boxplt) { 
    boxplot(var_name, main= "Box Plot without Outliers")
  }
  
  # Identify outliers
  outlier <- boxplot.stats(var_name)$out
  mo <- mean(outlier)
  var_name <- ifelse(var_name %in% outlier, NA, var_name)
  
  # Q-Q plot without outliers
  if (qqplt) {
    qqnorm(var_name, main="Q-Q plot with Outliers")
    qqline(var_name)
  }
  
  # Histogram without outliers
  if (histogram) { 
    hist(var_name, main = "Histogram with Outliers", xlab = NA, ylab = NA) 
  }
  
  # Box plot without outliers
  if (boxplt) { 
    boxplot(var_name, main = "Boxplot with Outliers") 
  }
  
  # Add the title for the overall plot section if any plots are displayed
  if (colTotal > 0) {
    title("Outlier Check", outer = TRUE)
    na2 <- sum(is.na(var_name))
    cat("Outliers identified:", na2 - na1, "\n")
    cat("Proportion (%) of outliers:", round((na2 - na1) / sum(!is.na(var_name)) * 100, 1), "\n")
    cat("Mean of the outliers:", round(mo, 2), "\n")
    cat("Mean without removing outliers:", round(m1, 2), "\n")
    cat("Mean if we remove outliers:", round(mean(var_name, na.rm = TRUE), 2), "\n")
  }
}
#outlier function is defined in previous chunck of code.
outlier_check_price = outlierKD2(data_final, Price, rm = FALSE, boxplt = TRUE, qqplt = TRUE)

## Outliers identified: 756 
## Proportion (%) of outliers: 8.5 
## Mean of the outliers: 14.05 
## Mean without removing outliers: 1.1 
## Mean if we remove outliers: 0

The price values in the dataset, including both typical and extreme values, are valid observations for our analysis. As such, removing these outliers may not be beneficial for our study.

#To check the value ranges
table(data_final$Price)
## 
##      0   0.99      1   1.04    1.2   1.26   1.29   1.49    1.5   1.59   1.61 
##   8903    145      3      1      1      1      1     46      1      1      1 
##    1.7   1.75   1.76   1.96   1.97   1.99      2   2.49    2.5   2.56   2.59 
##      2      1      1      1      1     73      3     25      1      1      1 
##    2.6    2.9   2.95   2.99   3.02   3.04   3.08   3.28   3.49   3.61   3.88 
##      1      1      1    124      1      1      1      1      7      1      1 
##    3.9   3.95   3.99   4.29   4.49   4.59    4.6   4.77    4.8   4.84   4.85 
##      1      1     57      1      9      1      1      1      1      1      1 
##   4.99      5   5.49   5.99   6.49   6.99   7.49   7.99   8.49   8.99      9 
##     70      1      5     26      5     11      2      7      2      5      1 
##   9.99     10  10.99  11.99  12.99  13.99     14  14.99  15.46  15.99  16.99 
##     19      2      2      3      4      2      1      9      1      1      2 
##  17.99  18.99   19.4   19.9  19.99  24.99  25.99  28.99  29.99  30.99  33.99 
##      2      1      1      1      5      3      1      1      5      1      1 
##  37.99  39.99  46.99  74.99  79.99  89.99 109.99 154.99    200 299.99 379.99 
##      1      2      1      1      1      1      1      1      1      1      1 
## 389.99 394.99 399.99    400 
##      1      1     12      1

As aldready mentioned, there are 8903 free apps (More apps with price as 0).

Visualization for Type Distribution

# Bar Plot for the Type Distribution
ggplot(data_final, aes(x = Type)) +
  geom_bar(fill = "pink", color = "black") +
  labs(title = "Distribution of App Types (Free vs Paid)", x = "Type", y = "Count") +
  theme_minimal()

As it is clear, there are more free apps.

#Display statistics for the Price of apps grouped by their Type
data_final$Type <- as.factor(data_final$Type)


summary_by_type <- data.frame(
  Type = levels(data_final$Type),
  Min_Price = tapply(data_clean$Price, data_clean$Type, min, na.rm = TRUE),
  Max_Price = tapply(data_clean$Price, data_clean$Type, max, na.rm = TRUE),
  Mean_Price = tapply(data_clean$Price, data_clean$Type, mean, na.rm = TRUE),
  Median_Price = tapply(data_clean$Price, data_clean$Type, median, na.rm = TRUE)
)


print(summary_by_type)
##      Type Min_Price Max_Price Mean_Price Median_Price
## Free Free      0.00         0    0.00000         0.00
## NaN   NaN      0.00         0    0.00000         0.00
## Paid Paid      0.99       400   14.04515         2.99
#Scatter plot for price distribution by app type
ggplot(data_final, aes(x = Type, y = Price, fill = Type)) +
  geom_boxplot() +
  labs(title = "Price Distribution by App Type", 
       x = "App Type", 
       y = "Price ($)") +
  theme_minimal()

Histogram for price distribution by App Type

ggplot(data_final, aes(x = Price, fill = Type)) +
  geom_histogram(binwidth = 60, alpha = 0.7, position = "identity") +
  facet_wrap(~ Type) +
  labs(title = "Price Distribution by App Type", 
       x = "Price ($)", 
       y = "Count") +
  theme_minimal()

Upon analyzing the price distribution across different app types, we found that some values in the Type column do not accurately represent the app prices (from above plot). Since we can fully rely on the Price values for our analysis, the Type column is seemed unnecessary.

Hence, Removing the Type column…

Dropping the Type column

#Using subset function
data_final <- subset(data_final, select = -Type)
#After removing the Type column and duplicated values
str(data_final)
## 'data.frame':    9659 obs. of  10 variables:
##  $ App           : chr  "Photo Editor & Candy Camera & Grid & ScrapBook" "Coloring book moana" "U Launcher Lite – FREE Live Cool Themes, Hide Apps" "Sketch - Draw & Paint" ...
##  $ Category      : Factor w/ 33 levels "ART_AND_DESIGN",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Rating        : num  4.1 3.9 4.7 4.5 4.3 4.4 3.8 4.1 4.4 4.7 ...
##  $ Reviews       : num  159 967 87510 215644 967 ...
##  $ Size          : num  19 14 8.7 25 2.8 5.6 19 29 33 3.1 ...
##  $ Installs      : num  1e+04 5e+05 5e+06 5e+07 1e+05 5e+04 5e+04 1e+06 1e+06 1e+04 ...
##  $ Price         : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Content.Rating: chr  "everyone" "everyone" "everyone" "teen" ...
##  $ Last.Updated  : Date, format: "2018-01-07" "2018-01-15" ...
##  $ Android.Ver   : Factor w/ 34 levels "1.0 and up","1.5 and up",..: 16 16 16 19 21 9 16 19 11 16 ...
The Type column is successfully removed.

Let’s do bivariate analysis on price and other variables starting from here.

Visualization for Price vs Installs

#Plotting a scatter plot between Price and installs
ggplot(data_final, aes(x=Price, y=log(data_clean$Installs))) +
  geom_point(color = 'red', size = 1, alpha = 0.5) + 
  geom_smooth(method = 'lm', color = 'blue', se = FALSE) +
  labs(title = "Price vs Installs", x = "Price (USD)", y = "Number of Installs") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

From the scatter plot, we can see that there are more number of installations with price value 0.

# Categorize the apps as "Free" or "Paid" based on Price
Price_Category <- ifelse(data_final$Price == 0, "Free", "Paid")
str(data_final$Price)
##  num [1:9659] 0 0 0 0 0 0 0 0 0 0 ...
str(Price_Category)
##  chr [1:9659] "Free" "Free" "Free" "Free" "Free" "Free" "Free" "Free" ...
#str(log(data_clean$Installs))

For a better visualization, we are categorizing price values 0 as free apps and plotting abox plot.

# Box plot of Price Category vs. log-transformed Installs
ggplot(data_final, aes(x = Price_Category, y = log(data_clean$Installs))) +
  geom_boxplot(fill = "lightblue", color = "darkblue", alpha = 0.6) +
  labs(title = "Price Categories vs. Log-Transformed Installs", 
       x = "Price Category", 
       y = "Log(Installs)") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))  

“Free” apps tend to have more installs than “Paid” apps. The difference between the means on the log scale is estimated to be between 3.47 and 3.97.

# Categorize the apps as "Free" or "Paid" based on Price
Price_Category <- ifelse(data_final$Price == 0, "Free", "Paid")
str(data_final$Price)
##  num [1:9659] 0 0 0 0 0 0 0 0 0 0 ...
str(Price_Category)
##  chr [1:9659] "Free" "Free" "Free" "Free" "Free" "Free" "Free" "Free" ...
#str(data_final$log(data_clean$Installs))

table(Price_Category)
## Price_Category
## Free Paid 
## 8903  756
# Add Price_Category to data_final
data_duplicate <- data_final
data_duplicate$Price_Category <- ifelse(data_final$Price == 0, "Free", "Paid")

# Create a summarized table for Price_Category and log_Installs
summary_table <- data_duplicate %>%
  group_by(Price_Category) %>%
  summarise(Average_Log_Installs = mean(log(data_clean$Installs), na.rm = TRUE),
            Count = n())

# View the summarized table
kable(summary_table, format = "html", col.names = c("Price Category", "Mean Log(Installs)", "App Count")) %>%
  kable_styling(full_width = FALSE, position = "center") 
Price Category Mean Log(Installs) App Count
Free -Inf 8903
Paid -Inf 756

Visualization for Price vs Reviews & Rating

# Plot Price vs. Reviews
ggplot(data_final, aes(x=Price, y=Reviews)) +
  geom_point(color = 'blue') +
  geom_smooth(method = 'lm', color = 'red', se = FALSE) +
  labs(title = "Price vs Reviews", x = "Price (USD)", y = "Number of Reviews") +
  theme_minimal() + 
  theme(
    panel.background = element_rect(fill = "white"),  # Set panel background to white
    plot.background = element_rect(fill = "white"),   # Set plot background to white
    axis.text.x = element_text(angle = 45, hjust = 1)
  )

# Plot Price vs. Rating
ggplot(data_final, aes(x=Price, y=Rating)) +
  geom_point(color = 'green') +
  geom_smooth(method = 'lm', color = 'red', se = FALSE) +
  labs(title = "Price vs Rating", x = "Price (USD)", y = "Rating") +
  theme_minimal() + 
  theme(
    panel.background = element_rect(fill = "white"),  # Set panel background to white
    plot.background = element_rect(fill = "white"),   # Set plot background to white
    axis.text.x = element_text(angle = 45, hjust = 1)
  )

Price vs Reviews with installation: Cheaper products tend to have more reviews, indicating higher popularity or more frequent purchases. In contrast, expensive products tend to have fewer reviews, possibly because fewer people buy higher-priced items.

Price vs Ratings with installation: Price does not strongly affect the average rating, but there is a slight trend where lower-priced products have more variation in ratings, while higher-priced products tend to receive more consistent ratings around 4. May be higher price apps are meeting the customer expectations.

Visualization for Price vs Reviews vs Installs

# Scatter plot of Price vs. Ratings with log_Installs as  color
ggplot(data_final, aes(x = Price, y = Rating,color = log(data_clean$Installs))) +
  geom_point(alpha = 0.6) +
  scale_color_gradient(low = "blue", high = "red") +  
  labs(title = "Price vs. Ratings with Installs as Color by Price", 
       x = "Price", 
       y = "Rating", 
       color = "log(Installs)") +
  theme_minimal()

# Scatter plot of Price vs. Reviews with log_Installs as color
ggplot(data_final, aes(x = Price, y = Reviews,color = log(data_clean$Installs))) +
  geom_point(alpha = 0.6) +
  scale_color_gradient(low = "darkgreen", high = "yellow") +  
  labs(title = "Price vs. reviewss with Installs as Color by Price", 
       x = "Price", 
       y = "Reviews", 
       color = "log(Installs)") +
  theme_minimal()

Concluding: Apps with lower prices, have more ratings and installs while apps priced higher tend to have fewer installs and more scattered ratings. Similarly, for reviews.

Visualization for Price vs Size

# Plot Price vs Size
ggplot(data_final, aes(x=Price, y=Size)) +
  geom_point(color = 'red') + 
  geom_smooth(method = 'lm', color = 'blue', se = FALSE) +
  labs(title = "Price vs Size", x = "Price (USD)", y = "App Size (MB)") +
  theme_minimal() 

Visulization for Distribution of Installs

# Bar plot for distribution of Installs
# Create a new data frame to store the factor levels
data_clean1_factor <- data_final  

data_clean1_factor$Installs <- factor(data_final$Installs, levels = c(0,1,5,10,50,100,500,1000,5000,10000,50000,100000,500000,1000000,5000000,10000000,50000000,100000000,500000000,1000000000))

# Create a bar plot with the ordered factor
ggplot(data_clean1_factor, aes(x = Installs)) +
  geom_bar() +
  xlab("Installs") +
  ylab("Count") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +  
  ggtitle("Distribution of App Installs")

Visualization for Rating Distribution

boxplot(data_final$Rating,ylab = "Rating", xlab = "Count",col = "Blue")

hist(data_clean$Rating, main="Histogram of Apps Rating after cleaning", xlab="Rating (count)", col = 'blue', breaks = 100 )

qqnorm(data_clean$Rating)
qqline(data_clean$Rating, col = "red")

Here, it could be seen the plots are much clearer but still skewed due to other outliers from 1-3 rating but as these may be the reason from which we could find why the apps are low rated hencecannot be removed from our dataset.

Visualization for Reviews

boxplot(data_final$Reviews,ylab = "Reviews", xlab = "Count",col = 'Blue')

hist(data_final$Reviews, main="Histogram of Apps Reviews", xlab="Reviews (count)", col = 'blue', breaks = 100 )

ggplot(data_final, aes(x = log(Reviews))) +
  geom_histogram(binwidth = 0.1, fill = "blue", color = "black") +
  labs(title = "Log-Transformed Histogram of Ratings", x = "Log(Rating)", y = "Count")

qqnorm(data_final$Reviews)
qqline(data_final$Reviews, col = "red")

Similar to the case of ratings the plots are skewed due to the outliers. Hence, we can use the log plot of reviews for the visualisation which is normalised version of Reviews. As they are skewed, they donot follow normal distribution.

Review frequency table

xkablesummary(data_final)
Table: Statistics summary.
App Category Rating Reviews Size Installs Price Content.Rating Last.Updated Android.Ver
Min Length:9659 FAMILY :1832 Min. :1.000 Min. : 0 Min. : 0.0085 Min. :0.000e+00 Min. : 0.000 Length:9659 Min. :2010-05-21 4.1 and up :2202
Q1 Class :character GAME : 959 1st Qu.:4.000 1st Qu.: 25 1st Qu.: 5.3000 1st Qu.:1.000e+03 1st Qu.: 0.000 Class :character 1st Qu.:2017-08-05 4.0.3 and up :1395
Median Mode :character TOOLS : 827 Median :4.200 Median : 967 Median : 13.1000 Median :1.000e+05 Median : 0.000 Mode :character Median :2018-05-04 4.0 and up :1285
Mean NA BUSINESS : 420 Mean :4.173 Mean : 216593 Mean : 20.1512 Mean :7.778e+06 Mean : 1.099 NA Mean :2017-10-30 Varies with device: 990
Q3 NA MEDICAL : 395 3rd Qu.:4.500 3rd Qu.: 29401 3rd Qu.: 27.0000 3rd Qu.:1.000e+06 3rd Qu.: 0.000 NA 3rd Qu.:2018-07-17 4.4 and up : 818
Max NA PERSONALIZATION: 376 Max. :5.000 Max. :78158306 Max. :100.0000 Max. :1.000e+09 Max. :400.000 NA Max. :2018-08-08 2.3 and up : 616
NA NA (Other) :4850 NA NA NA NA NA NA NA (Other) :2353
outlierKD2(data_final, Reviews)
## Outliers identified: 1656 
## Proportion (%) of outliers: 20.7 
## Mean of the outliers: 1228141 
## Mean without removing outliers: 216592.6 
## Mean if we remove outliers: 7280.61

To check which are outliers lets make sections of data that is create bins to check which bins have maximum data, this would help us see how reviews are distributed.

Binned reviews

Binning into equal count in each bin to check averge rating for each bin

# Define the new custom breaks for bins
# Ensure there are no NA values


# Define new breaks for more even intervals
breaks <- c(0, 100, 500, 1000, 2500, 5000, 10000, 25000,50000,100000, 300000,1000000,Inf)

# Create a categorical variable based on the new breaks
Review_Category <- cut(data_final$Reviews, breaks = breaks, right = FALSE, 
                   labels = c("0+","100+", "500+", "1K+",
                              "2.5K+", "5K+", "10K+","25K+",
                              "50K+", "100K+","300K+","1M+"))

# Count the number of values in each bin
bin_counts <- as.data.frame(table(Review_Category))

# Rename the columns for clarity
colnames(bin_counts) <- c("Review_Category", "Count")

# Print the counts
print(bin_counts)
##    Review_Category Count
## 1               0+  3327
## 2             100+  1065
## 3             500+   462
## 4              1K+   586
## 5            2.5K+   475
## 6              5K+   474
## 7             10K+   719
## 8             25K+   606
## 9             50K+   498
## 10           100K+   647
## 11           300K+   451
## 12             1M+   349
# Create a line plot of the binned counts
ggplot(bin_counts, aes(x = Review_Category, y = Count, group = 1)) +
  geom_line(color = "blue", size = 1) +
  geom_point(color = "blue", size = 3) +
  labs(title = "Count of Reviews by Review Category", 
       x = "Review Category", 
       y = "Count of Reviews") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))  # Rotate x-axis labels for readability

Hence, high reviews can be observed in less apps and less reviews can be observed in more apps which is expected.

Boxplots for Rating vs Reviews

boxplot( data_final$Rating~ Review_Category, data = data_clean, 
        main = "Boxplot of Review Counts by Review Category", 
        xlab = "Review Category", 
        ylab = "Review Rating",
        las = 2,        # Rotate the x-axis labels for readability
        col = "lightblue")  # Optional: Set color for the boxplots

In this we could observe that, as reviews increase the median of rating increased and the values clustered around higher ratings which could show that high reviews, could mean a better rated app.

Mean value of Ratings for each Review bins

# Calculate the mean Rating for each Review_Category
mean_ratings <- tapply(data_final$Rating, Review_Category, mean, na.rm = TRUE)

# Convert the result to a data frame for better readability
mean_ratings_df <- data.frame(Review_Category = names(mean_ratings), Mean_Rating = as.numeric(mean_ratings))

# Print the mean ratings for each review bin
print(mean_ratings_df)
##    Review_Category Mean_Rating
## 1               0+    4.126221
## 2             100+    4.029538
## 3             500+    4.063188
## 4              1K+    4.107030
## 5            2.5K+    4.129572
## 6              5K+    4.191139
## 7             10K+    4.221836
## 8             25K+    4.231848
## 9             50K+    4.293775
## 10           100K+    4.329830
## 11           300K+    4.375610
## 12             1M+    4.426361
# Define correct order of Review_Category as a factor
mean_ratings_df$Review_Category <- factor(mean_ratings_df$Review_Category, 
                                          levels = c("0+","100+", "500+", "1K+",
                                                     "2.5K+", "5K+", "10K+","25K+",
                                                     "50K+", "100K+", "300K+", "1M+"))

# Plot the mean ratings for each review bin in the correct order
ggplot(mean_ratings_df, aes(x = Review_Category, y = Mean_Rating)) +
  geom_bar(stat = "identity", fill = "steelblue") +  # Use bar plot
  labs(title = "Mean Rating by Review Category",
       x = "Review Category",
       y = "Mean Rating") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))  # Rotate x-axis labels for readability

As we can see, the mean rating increases as the reviews increase.

Histogram for Reviews and Rating

# Create a new data frame for plotting
plot_data <- data.frame(Rating = data_final$Rating, Review_Category = Review_Category)

# Create a histogram of Ratings, faceted by Review_Category
ggplot(plot_data, aes(x = Rating)) +
  geom_histogram(bins = 30, fill = "blue", alpha = 0.7) +
  facet_wrap(~ Review_Category, labeller = label_wrap_gen()) +  # Facet by Review_Category
  theme_minimal() +
  labs(title = "Histograms of Ratings by Review Category", x = "Rating", y = "Frequency")

This is another representation of ratings vs reviews.

Visualization for Reviews vs Installs

# Scatter plot for Installs vs Reviews
ggplot(data_clean1_factor, aes(x = Review_Category, y = Installs)) +
  geom_point(color = "blue", alpha = 0.5) +
  geom_smooth(method = "lm", color = "red", se = FALSE) +  # Add a regression line
  labs(title = "Scatter Plot of Installs vs Reviews", 
       x = "Number of Reviews", 
       y = "Number of Installs") +
  theme_minimal()

It could be observed that more the number of installs, more the number of reviews.

Visualisation of Mean for different Install Categories

# Calculate the mean Rating for each Review_Category
mean_ratings <- tapply(data_final$Rating, data_clean1_factor$Installs, mean, na.rm = TRUE)

# Convert the result to a data frame for better readability
mean_ratings_df <- data.frame(Installs = names(mean_ratings), Mean_Rating = as.numeric(mean_ratings))

# Print the mean ratings for each review bin
print(mean_ratings_df)
##    Installs Mean_Rating
## 1         0    4.173243
## 2         1    4.210262
## 3         5    4.221302
## 4        10    4.254142
## 5        50    4.240882
## 6       100    4.254521
## 7       500    4.176062
## 8      1000    4.086812
## 9      5000    4.035362
## 10    10000    4.041438
## 11    50000    4.048356
## 12    1e+05    4.117373
## 13    5e+05    4.168462
## 14    1e+06    4.216335
## 15    5e+06    4.227677
## 16    1e+07    4.299146
## 17    5e+07    4.333663
## 18    1e+08    4.386702
## 19    5e+08    4.375000
## 20    1e+09    4.215000
mean_ratings_df$Installs = factor(mean_ratings_df$Installs, levels = c(0,1,5,10,50,100,500,1000,5000,10000,50000,100000,500000,1000000,5000000,10000000,50000000,100000000,500000000,1000000000))

# Plot the mean ratings for each review bin in the correct order
ggplot(mean_ratings_df, aes(x = Installs, y = Mean_Rating)) +
  geom_bar(stat = "identity", fill = "steelblue") +  # Use bar plot
  labs(title = "Mean Rating by Install Category",
       x = "Installs Category",
       y = "Mean Rating") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))  # Rotate x-axis labels for readability

Observing the flucuation of Rating for different Installs it could be seen that there is no constant increase or deacrease trend seen for Installs and rating, which could be expected as more Rating doesnot necessarily mean more Installs. But high Installs and high Rating could be seen as good app.

Visualization for Rating vs Installs

# Scatter plot of log-transformed Installs vs. Rating
ggplot(data_final, aes(x = log(Installs) , y = Rating)) +
  geom_point(color = "blue", alpha = 0.6) +
  geom_smooth(method = "lm", color = "red", se = FALSE) +  # Add a regression line
  labs(title = "Log-Transformed Installs vs. Rating", 
       x = "Rating", 
       y = "Installs") +
  theme_minimal()

Visualization for Rating vs Installs by Category

Visualization for Category Distribution

category_counts <- table(data_final$Category)

# Convert to data frame for plotting
category_counts_df <- as.data.frame(category_counts)
colnames(category_counts_df) <- c("Category", "Frequency") 

ggplot(category_counts_df, aes(x = reorder(Category, Frequency), y = Frequency)) + 
  geom_bar(stat = "identity", fill = "#1f3374") +
  geom_text(aes(label = Frequency), vjust = 0.5, hjust=1, size=2.5, color='#f8c220') +
  coord_flip() +  
  labs(title = "Distribution of Categories", x = "Category", y = "Frequency") +
  theme_minimal() +
   theme(
    plot.background = element_rect(fill = "#efefef", color = NA),
    panel.background = element_rect(fill = "#efefef", color = NA),
    axis.text.y = element_text(size = 5.5)
  )

AS it can be seen from the graph above, most of the apps in the dataset belong to the Family and Game, tools category, and Beauty,comics have the least number of apps.

Visualization for Category vs. Installs

Below is a boxplot show the distribution of number of installs for each category order by mean from highest to lowest.

ggplot(data_clean, aes(x = reorder(Category, log(data_final$Installs),  FUN = mean), y = log(data_clean$Installs))) +
  geom_boxplot(outlier.color = "#f05555", outlier.shape = 1, color='#1f3374', fill="#efefef") +  # Red outliers for emphasis
  coord_flip() +  # Flip for better readability
  scale_y_log10() +
  theme_minimal() +
  labs(title = "Distribution of Installs by Category",
       x = "Category",
       y = "Number of Installs (Log Scale)") +
    theme(
    plot.background = element_rect(fill = "#efefef", color = NA),
    panel.background = element_rect(fill = "#efefef", color = NA),
    axis.text.y = element_text(size = 5.5)
  )

It can be seen from the graph that, on average, Entertainment apps receive the highest number of installations, followed by Education, Game, Photography, and Weather apps. In contrast, Art & Design apps have the fewest installations.

Visualization for Category vs. App Size

#convert_size <- function(size) {
#    size <- gsub(",", "", size)  # Remove commas
#    size <- tolower(size)  # Make lowercase for consistency
      
      # Handle "varies with device" by assigning NA
#    if (size == "varies with device") return(NA)
      
      # Convert "k" to MB (1 MB = 1000 KB)
 #   if (grepl("k", size)) return(as.numeric(gsub("k", "", size)) / 1000)
      
      # Convert "M" to numeric MB
  #  if (grepl("m", size)) return(as.numeric(gsub("m", "", size)))
      
      # Handle numeric values directly (e.g., "1000+")
   # if (grepl("\\d+\\+", size)) return(as.numeric(gsub("\\+", "", size)) / 1000)
      
      # Default case: return as numeric if possible
    #return(as.numeric(size))
    #}

Below is the figure showing the distribution of app sizes in each category.

#df_clean <- data_clean %>%
 # mutate(Size = sapply(Size, convert_size)) %>%
#  filter(!is.na(Size))

# Plot the histogram with faceting by category
ggplot(data_clean, aes(x = Size)) +
  geom_histogram(binwidth = 5, fill = "#304ba6", color = "black") +
  facet_wrap(~ Category, scales = "free_y") +
  theme_minimal() +
  labs(
    title = "Distribution of App Sizes by Category",
    x = "Size (MB)",
    y = "Count"
  ) +
  theme(
    strip.text = element_text(size = 5),
    axis.text.x = element_text(size = 7, angle = 45, hjust = 1)
  )

str(data_clean)
## 'data.frame':    9659 obs. of  13 variables:
##  $ App           : chr  "Photo Editor & Candy Camera & Grid & ScrapBook" "Coloring book moana" "U Launcher Lite – FREE Live Cool Themes, Hide Apps" "Sketch - Draw & Paint" ...
##  $ Category      : chr  "ART_AND_DESIGN" "ART_AND_DESIGN" "ART_AND_DESIGN" "ART_AND_DESIGN" ...
##  $ Rating        : num  4.1 3.9 4.7 4.5 4.3 4.4 3.8 4.1 4.4 4.7 ...
##  $ Reviews       : num  159 967 87510 215644 967 ...
##  $ Size          : num  19 14 8.7 25 2.8 5.6 19 29 33 3.1 ...
##  $ Installs      : num  1e+04 5e+05 5e+06 5e+07 1e+05 5e+04 5e+04 1e+06 1e+06 1e+04 ...
##  $ Type          : chr  "Free" "Free" "Free" "Free" ...
##  $ Price         : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Content.Rating: chr  "Everyone" "Everyone" "Everyone" "Teen" ...
##  $ Genres        : chr  "Art & Design" "Art & Design;Pretend Play" "Art & Design" "Art & Design" ...
##  $ Last.Updated  : chr  "January 7, 2018" "January 15, 2018" "August 1, 2018" "June 8, 2018" ...
##  $ Current.Ver   : chr  "1.0.0" "2.0.0" "1.2.4" "Varies with device" ...
##  $ Android.Ver   : chr  "4.0.3 and up" "4.0.3 and up" "4.0.3 and up" "4.2 and up" ...
ggplot(data_clean, aes(x = reorder(Category, Size, FUN = median), y = Size)) + 
  geom_boxplot(outlier.color = "#f05555", outlier.shape = 1) + 
  coord_flip() + 
  theme_minimal() + 
  labs(
    title = "Boxplot of App Sizes by Category (Ordered by Median)", 
    x = "Category", 
    y = "Size (MB)"
  ) + 
  theme(
    strip.text = element_text(size = 8), 
    axis.text.x = element_text(size = 7, angle = 45, hjust = 1)
  )

As it can be seen from the two figures above, most categories exhibit right-skewed app sizes, with the majority being under 50MB. However, the Game category stands out with a significantly larger median app size compared to other categories.

Visualization for Category vs. Reviews

Below is the graph displaying the distribution of reviews left by users for each category.

df_aggregated <- data_final %>% 
  group_by(Category) %>% 
  summarise(Total_Reviews = sum(Reviews, na.rm = TRUE))

#df_aggregated
# Plot the total reviews by category using a bar chart
ggplot(df_aggregated, aes(x = reorder(Category, -Total_Reviews), y = log10(Total_Reviews))) + 
  geom_bar(stat = "identity", fill = "#1f3374") + 
  labs(
    title = "Log-Scaled Total Reviews by Category", 
    x = "Category", 
    y = "Log10(Total Number of Reviews)"
  ) + 
  theme_minimal() + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

AS it can be seen that game apps have most reviews while events apps have the least reviews.

Histogram for Category vs. Rating

Below is the figure demonstrating the distribution of number of rating for each category.

ggplot(data_final, aes(x = Rating)) + 
  geom_histogram(binwidth = 0.5, fill = "#1f3374", color='#efefef') + 
  facet_wrap(~ Category, scales = "free_y") +  # Facet by Category with independent y-axis
  scale_x_continuous(limits = c(1, 5), breaks = seq(1, 5, by = 0.5)) +  # Restrict x-axis to 1-5
  theme_minimal() + 
  labs(
    title = "Distribution of Ratings by Category", 
    x = "Rating", 
    y = "Count"
  ) + 
  theme(
    strip.text = element_text(size = 5),  # Adjust facet label size
    axis.text.x = element_text(size = 5, angle = 45, hjust = 1),  # Rotate x-axis labels
    plot.title = element_text(hjust = 0.5)  # Center the plot title
  )

As illustrated in the graph above, all categories have app ratings that range between 4.0 and 5.0.

Visualization for Android Version

Below is the figure showing the distribution of Android versions.

extract_version <- function(version) {
  version <- tolower(version)  # Make lowercase for consistency
  
  # Handle "Varies with device" and "NaN"
  if (version == "varies with device" || version == "nan") return(NA)
  
  # Extract the first version in case of ranges (e.g., "4.1 - 7.1.1" -> "4.1")
  first_version <- strsplit(version, "[- ]")[[1]][1]
  
  # Remove "and up" if present (e.g., "4.0 and up" -> "4.0")
  first_version <- gsub("and up", "", first_version)
  
  return(as.numeric(first_version))  # Convert to numeric
}
df_clean <- data_final %>%
  mutate(Android_Ver = sapply(Android.Ver, extract_version)) %>%
  filter(!is.na(Android_Ver))  # Remove rows with NA in Android_Ver

android_installs <- data_final %>% 
  group_by(Android.Ver) %>% 
  summarize(Total_Installs = sum(Installs, na.rm = TRUE))
ggplot(df_clean, aes(x = Android_Ver)) + 
  geom_histogram(binwidth = 0.5, fill = "#1f3374", color='#efefef') + 
  scale_x_continuous(breaks = seq(1, 8, by = 1.0)) +  # Set x-axis ticks from 1.0 to 8.0
  theme_minimal() + 
  labs(
    title = "Distribution of Android Versions", 
    x = "Android Version", 
    y = "Count"
  ) + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

As it can be seen that, the minimum required Android Version for most apps is 4.0 and up.

extract_version <- function(version) {
  version <- tolower(version)  # Make lowercase for consistency
  
  # Handle "Varies with device" and "NaN"
  if (version == "varies with device" || version == "nan") return(NA)
  
  # Extract the first version in case of ranges (e.g., "4.1 - 7.1.1" -> "4.1")
  first_version <- strsplit(version, "[- ]")[[1]][1]
  
  # Remove "and up" if present (e.g., "4.0 and up" -> "4.0")
  first_version <- gsub("and up", "", first_version)
  
  return(as.numeric(first_version))  # Convert to numeric
}

Bar plot for Android Version vs. Installs

Below is the graph showing the number of installs for each minimum required Android Version.

ggplot(data_final, aes(x = reorder(Android.Ver, Installs), y = Installs)) + 
  geom_bar(stat = "identity", fill = "#1f3374") + 
  coord_flip() +  # Flip coordinates for better readability
  scale_y_continuous(labels = scales::comma) +  # Format y-axis with commas
  theme_minimal() + 
  labs(
    title = "Total Installs by Android Version", 
    x = "Android Version", 
    y = "Total Installs"
  ) + 
  theme(
    axis.text.y = element_text(size = 8),  # Adjust y-axis text size
    plot.title = element_text(hjust = 0.5)  # Center the plot title
  )

It can be seen that the highest number of installation is when there is different requirements of the versions for the app to run.

Boxplot for Android Version vs. Reviews

Below is the distribution of reviews for each minimum required Android Version.

df_clean <- data_final %>% 
  filter(!is.na(Android.Ver) & !is.na(Reviews)) %>% 
  mutate(Scaled_Reviews = log10(Reviews + 1))
ggplot(df_clean, aes(x = reorder(Android.Ver, Scaled_Reviews, FUN = median), y = Scaled_Reviews)) + 
  geom_boxplot(outlier.color = "#f05555", outlier.shape = 1) +  # Boxplot with red outliers
  coord_flip() +  # Flip coordinates for better readability
  theme_minimal() + 
  labs(
    title = "Distribution of Scaled Reviews by Android Version", 
    x = "Android Version", 
    y = "Scaled Reviews (Log10)"
  ) + 
  theme(
    axis.text.y = element_text(size = 8),  # Adjust y-axis text size
    plot.title = element_text(hjust = 0.5)  # Center the plot title
  )

It can be seen that the version from 4.1 to 7.1.1 have the highest number of reviews, whiel version from 5.0 to 7.1.1 have the least number of reviews.

Histogram for Android Version vs. Rating

Below is the plot showing the number of ratings for each Android Version.

ggplot(df_clean, aes(x = Rating, fill = Android.Ver)) + 
  geom_histogram(binwidth = 0.5, position = "stack", color = "black", alpha = 0.7) + 
  scale_x_continuous(breaks = seq(1, 5, by = 0.5)) +  # Set x-axis breaks
  theme_minimal() + 
  labs(
    title = "Histogram of Ratings by Android Version", 
    x = "Rating", 
    y = "Count"
  ) + 
  theme(
    axis.text.x = element_text(size = 8), 
    axis.text.y = element_text(size = 8), 
    plot.title = element_text(hjust = 0.5)  # Center the plot title
  )

It can be seen that most Android Version have ratings range between 4.0 and 5.0.

Distribution for Content.Rating

# Clean and prepare the Last Updated  and Content column
data_final <- data_final %>%
  mutate(
    Content.Rating = as.factor(Content.Rating)
  )

# 1. Content Rating Distribution
content_rating_dist <- table(data_final$Content.Rating)
print("Content Rating Distribution:")
## [1] "Content Rating Distribution:"
print(content_rating_dist)
## 
## adults only 18+        everyone    everyone 10+      mature 17+            teen 
##               3            7903             322             393            1036 
##         unrated 
##               2

Visualization for Content Rating

# Bar plot for Content Rating
ggplot(data_final, aes(x = Content.Rating)) +
  geom_bar(fill = "skyblue") +
  geom_text(stat = "count", aes(label = ..count..), vjust = -0.5) +
  labs(title = "Distribution of App Content Ratings",
       x = "Content Rating",
       y = "Number of Apps") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Everyone is the most dominant Category with 81.82% of all apps and Adults 18+ being most least significant category with about 0.03% of overall app population

# Last Updated Analysis
# Create summary of updates by month and year
updates_by_month <- data_final %>%
  mutate(
    update_month = format(Last.Updated, "%Y-%m"),
    update_year = year(Last.Updated)
  ) %>%
  group_by(update_month) %>%
  summarize(count = n()) %>%
  arrange(update_month)
# Plot updates over time
#ggplot(updates_by_month, aes(x = as.Date(paste0(update_month, "-01")), y = count)) +
  #geom_line(color = "blue") +
  #geom_point(color = "red") +
  #labs(title = "Number of App Updates Over Time",
  #     x = "Date",
  #     y = "Number of Updates") +
  #theme_minimal() +
 # theme(axis.text.x = element_text(angle = 45, hjust = 1))

The number of updates have drastically increased from the end of 2017

# Content Rating and Update Frequency Relationship
update_frequency_by_rating <- data_final %>%
  group_by(Content.Rating) %>%
  summarize(
    avg_last_update = mean(Last.Updated),
    median_last_update = median(Last.Updated),
    n_apps = n()
  )
print("\nUpdate Frequency by Content Rating:")
## [1] "\nUpdate Frequency by Content Rating:"
print(update_frequency_by_rating)
## # A tibble: 6 × 4
##   Content.Rating  avg_last_update median_last_update n_apps
##   <fct>           <date>          <date>              <int>
## 1 adults only 18+ 2018-07-20      2018-07-24              3
## 2 everyone        2017-10-20      2018-04-20           7903
## 3 everyone 10+    2017-11-24      2018-06-06            322
## 4 mature 17+      2018-02-18      2018-07-09            393
## 5 teen            2017-12-03      2018-06-05           1036
## 6 unrated         2013-10-25      2013-10-25              2
# Content Rating Basic Analysis
#print("Basic Content Rating Analysis:")
#content_rating_counts <- table(data_final$Content.Rating)
#print(content_rating_counts)

# Basic bar plot for Content Rating
#ggplot(data_final, aes(x = Content.Rating)) +
#  geom_bar(fill = "skyblue") +
#  geom_text(stat = "count", aes(label = ..count..), vjust = -0.5) +
#  labs(title = "Distribution of App Content Ratings",
#       x = "Content Rating",
#       y = "Number of Apps") +
#   theme_minimal() +
#   theme(axis.text.x = element_text(angle = 45, hjust = 1))
# 
# # Calculate percentages
# content_rating_percentages <- prop.table(content_rating_counts) * 100
# print("\nContent Rating Percentages:")
# print(round(content_rating_percentages, 2))
# 
# # 1.2 Last Updated Basic Analysis
# data_final$Last.Updated <- as.Date(data_final$Last.Updated, format = "%B %d, %Y")
# 
# print("\nLast Updated Summary Statistics:")
# summary(data_final$Last.Updated)
# Time-based Analysis
data_final <- data_final %>%
  mutate(
    update_year = year(Last.Updated),
    update_month = month(Last.Updated),
    update_quarter = quarter(Last.Updated),
    days_since_update = as.numeric(difftime(max(Last.Updated), Last.Updated, units = "days"))
  )

# Monthly update pattern
monthly_updates <- data_final %>%
  group_by(update_year, update_month) %>%
  summarize(count = n()) %>%
  mutate(date = as.Date(paste(update_year, update_month, "01", sep = "-")))

ggplot(monthly_updates, aes(x = date, y = count)) +
  geom_line(color = "blue") +
  geom_point() +
  labs(title = "App Updates Over Time",
       x = "Date",
       y = "Number of Updates") +
  theme_minimal()

# 2.2 Content Rating Distribution by Update Quarter
ggplot(data_final, aes(x = factor(update_quarter), fill = Content.Rating)) +
  geom_bar(position = "dodge") +
  labs(title = "Content Rating Distribution by Quarter",
       x = "Quarter",
       y = "Count") +
  theme_minimal()

# 3.1 Update Frequency Analysis by Content Rating
update_patterns <- data_final %>%
  group_by(Content.Rating) %>%
  summarize(
    avg_days_since_update = mean(days_since_update),
    median_days_since_update = median(days_since_update),
    sd_days_since_update = sd(days_since_update),
    n_apps = n(),
    cv = sd(days_since_update) / mean(days_since_update) * 100  # Coefficient of Variation
  ) %>%
  arrange(avg_days_since_update)

print("\nUpdate Patterns by Content Rating:")
## [1] "\nUpdate Patterns by Content Rating:"
print(update_patterns)
## # A tibble: 6 × 6
##   Content.Rating  avg_days_since_update median_days_since_update
##   <fct>                           <dbl>                    <dbl>
## 1 adults only 18+                  18.3                      15 
## 2 mature 17+                      171.                       30 
## 3 teen                            248.                       64 
## 4 everyone 10+                    257.                       63 
## 5 everyone                        292.                      110 
## 6 unrated                        1748.                     1748.
## # ℹ 3 more variables: sd_days_since_update <dbl>, n_apps <int>, cv <dbl>
# 3.3 Advanced Visualization - Heatmap of Updates
update_heatmap_data <- data_final %>%
  group_by(update_month, Content.Rating) %>%
  summarize(count = n()) %>%
  spread(Content.Rating, count)

# Convert to matrix for heatmap
update_matrix <- as.matrix(update_heatmap_data[,-1])
rownames(update_matrix) <- month.abb[update_heatmap_data$update_month]

# Create heatmap
heatmap(update_matrix, 
        Colv = NA, 
        Rowv = NA,
        scale = "column",
        col = colorRampPalette(c("white", "steelblue"))(50),
        main = "Update Pattern Heatmap by Content Rating",
        xlab = "Content Rating",
        ylab = "Month")

# 3.4 Time Series Decomposition
# Focus on Everyone category as an example
#everyone_ts <- monthly_updates %>%
#  filter(count > 0) %>%
#  select(count) %>%
#  ts(frequency = 12)

#decomposed <- decompose(everyone_ts)
#plot(decomposed)

# 3.4 Update Velocity Analysis
update_velocity <- data_final %>%
  group_by(Content.Rating) %>%
  summarize(
    update_velocity = n() / n_distinct(update_month),
    total_apps = n()
  ) %>%
  arrange(desc(update_velocity))

print("\nUpdate Velocity by Content Rating:")
## [1] "\nUpdate Velocity by Content Rating:"
print(update_velocity)
## # A tibble: 6 × 3
##   Content.Rating  update_velocity total_apps
##   <fct>                     <dbl>      <int>
## 1 everyone                  659.        7903
## 2 teen                       86.3       1036
## 3 mature 17+                 32.8        393
## 4 everyone 10+               26.8        322
## 5 adults only 18+             1.5          3
## 6 unrated                     1            2

###Observation for Update Frequency Velocity Analysis: This column represents the average number of updates per app for each content rating category. It reflects how frequently apps in each category receive updates.

# 1. Update Cycle Analysis
data_final <- data_final %>%
  mutate(
    Last.Updated = as.Date(Last.Updated, format = "%B %d, %Y"),
    day_of_week = wday(Last.Updated, label = TRUE),
    week_of_year = week(Last.Updated),
    month_of_year = month(Last.Updated, label = TRUE),
    season = case_when(
      month_of_year %in% c("Dec", "Jan", "Feb") ~ "Winter",
      month_of_year %in% c("Mar", "Apr", "May") ~ "Spring",
      month_of_year %in% c("Jun", "Jul", "Aug") ~ "Summer",
      TRUE ~ "Fall"
    )
  )

# Day of Week Update Pattern by Content Rating
dow_pattern <- data_final %>%
  group_by(Content.Rating, day_of_week) %>%
  summarise(count = n()) %>%
  group_by(Content.Rating) %>%
  mutate(percentage = count/sum(count) * 100)

ggplot(dow_pattern, aes(x = day_of_week, y = percentage, fill = Content.Rating)) +
  geom_bar(stat = "identity", position = "dodge") +
  facet_wrap(~Content.Rating) +
  labs(title = "Update Day Preferences by Content Rating",
       x = "Day of Week",
       y = "Percentage of Updates") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

# 2. Update Interval Analysis
update_intervals <- data_final %>%
  group_by(Content.Rating) %>%
  arrange(Last.Updated) %>%
  mutate(days_between_updates = as.numeric(Last.Updated - lag(Last.Updated))) %>%
  summarise(
    mean_interval = mean(days_between_updates, na.rm = TRUE),
    median_interval = median(days_between_updates, na.rm = TRUE),
    std_dev = sd(days_between_updates, na.rm = TRUE),
    cv = std_dev / mean_interval * 100  # Coefficient of Variation
  )

print("Update Interval Analysis:")
## [1] "Update Interval Analysis:"
print(update_intervals)
## # A tibble: 6 × 5
##   Content.Rating  mean_interval median_interval std_dev    cv
##   <fct>                   <dbl>           <dbl>   <dbl> <dbl>
## 1 adults only 18+        15                  15    7.07  47.1
## 2 everyone                0.380               0    3.53 929. 
## 3 everyone 10+            8.33                1   46.5  557. 
## 4 mature 17+              5.48                0   21.5  392. 
## 5 teen                    2.36                0   14.7  622. 
## 6 unrated              1213                1213   NA     NA
# 3. Seasonal Update Intensity
seasonal_intensity <- data_final %>%
  group_by(Content.Rating, season) %>%
  summarise(
    update_count = n(),
    update_intensity = n() / n_distinct(Last.Updated)
  ) %>%
  arrange(Content.Rating, desc(update_intensity))

# Visualization of seasonal patterns
ggplot(seasonal_intensity, aes(x = season, y = update_intensity, fill = Content.Rating)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(title = "Seasonal Update Intensity by Content Rating",
       x = "Season",
       y = "Update Intensity") +
  theme_minimal()

# 4. Update Clustering Analysis
#update_features <- data_final %>%
#  group_by(Content.Rating) %>%
#  summarise(
#    mean_week = mean(week_of_year),
#    std_week = sd(week_of_year),
#    update_frequency = n(),
#    weekend_ratio = sum(day_of_week %in% c("Sat", "Sun")) / n()
#  )

# Normalize the features
#update_features_norm <- scale(update_features[,-1])
#rownames(update_features_norm) <- update_features$Content.Rating

# Perform hierarchical clustering
#update_clusters <- hclust(dist(update_features_norm))
#plot(update_clusters, main = "Hierarchical Clustering of Content Ratings by Update Patterns")
# 6. Update Consistency Score
#onsistency_score <- data_final %>%
#  group_by(Content.Rating) %>%
#  summarise(
#    total_updates = n(),
#    unique_days = n_distinct(Last.Updated),
#   consistency_score = (total_updates / unique_days) * 
#      (1 - sd(as.numeric(day_of_week)) / 7)  # Normalized consistency metric
#  ) %>%
#  arrange(desc(consistency_score))

#print("\nUpdate Consistency Scores:")
#print(consistency_score)
# Convert Last.Updated to numeric (days since reference date) if not already done
# reference_date <- min(data_final$Last.Updated, na.rm = TRUE)  # Reference date
# data_final$Days.Since.Update <- as.numeric(data_final$Last.Updated - reference_date)
# 
# # Perform the Kolmogorov-Smirnov test on the numeric 'Days.Since.Update' values
# content_ratings <- unique(data_final$Content.Rating)
# ks_results <- data.frame(
#   rating1 = character(),
#   rating2 = character(),
#   p_value = numeric()
# )
# 
# for (i in 1:(length(content_ratings)-1)) {
#   for (j in (i+1):length(content_ratings)) {
#     # Extract groups, removing NA values
#     group1 <- na.omit(data_final$Days.Since.Update[data_final$Content.Rating == content_ratings[i]])
#     group2 <- na.omit(data_final$Days.Since.Update[data_final$Content.Rating == content_ratings[j]])
#     
#     # Check if both groups have enough data for comparison
#     if(length(group1) > 1 && length(group2) > 1) {
#       ks_test <- ks.test(group1, group2)
#       ks_results <- rbind(ks_results, 
#                           data.frame(rating1 = content_ratings[i],
#                                      rating2 = content_ratings[j],
#                                      p_value = ks_test$p.value))
#     }
#   }
# }
# 
# print("\nKolmogorov-Smirnov Test Results:")
# print(ks_results[ks_results$p_value < 0.05,])

Visualization for Content Rating vs Installs

# 1. Basic statistics for Installs by Content Rating
installs_by_rating <- data_final %>%
  group_by(Content.Rating) %>%
  summarise(
    mean_installs = mean(Installs, na.rm = TRUE),
    median_installs = median(Installs, na.rm = TRUE),
    total_installs = sum(Installs, na.rm = TRUE),
    n_apps = n()
  ) %>%
  arrange(desc(mean_installs))

print("Summary of Installs by Content Rating:")
## [1] "Summary of Installs by Content Rating:"
print(installs_by_rating)
## # A tibble: 6 × 5
##   Content.Rating  mean_installs median_installs total_installs n_apps
##   <fct>                   <dbl>           <dbl>          <dbl>  <int>
## 1 teen                15914358.          500000    16487275393   1036
## 2 everyone 10+        12472894.         1000000     4016271795    322
## 3 everyone             6602474.           50000    52179352961   7903
## 4 mature 17+           6203529.          500000     2437986878    393
## 5 adults only 18+       666667.          500000        2000000      3
## 6 unrated                25250            25250          50500      2
# 2. Visualize distribution of installs by content rating
ggplot(data_final, aes(x = Content.Rating, y = log10(Installs))) +
  geom_boxplot(fill = "lightblue") +
  labs(title = "Distribution of App Installs by Content Rating",
       x = "Content Rating",
       y = "Log10(Number of Installs)") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Visualization for Last Updated vs Installs

data_analysis <- data_final %>%
  mutate(
    days_since_update = as.numeric(difftime(max(Last.Updated), Last.Updated, units = "days")),
    update_year = year(Last.Updated),
    update_month = month(Last.Updated)
  )


data_analysis <- data_analysis %>%
  mutate(update_recency = ifelse(days_since_update <= median(days_since_update),
                                "Recent Update", "Old Update"))

recent_vs_old <- data_analysis %>%
  group_by(Content.Rating, update_recency) %>%
  summarise(
    mean_installs = mean(Installs, na.rm = TRUE),
    median_installs = median(Installs, na.rm = TRUE),
    n_apps = n()
  )

print("\nComparison of Installs by Update Recency and Content Rating:")
## [1] "\nComparison of Installs by Update Recency and Content Rating:"
print(recent_vs_old)
## # A tibble: 10 × 5
## # Groups:   Content.Rating [6]
##    Content.Rating  update_recency mean_installs median_installs n_apps
##    <fct>           <chr>                  <dbl>           <dbl>  <int>
##  1 adults only 18+ Recent Update        666667.          500000      3
##  2 everyone        Old Update          1787608.           10000   4110
##  3 everyone        Recent Update      11819742.          500000   3793
##  4 everyone 10+    Old Update          2711120.          100000    135
##  5 everyone 10+    Recent Update      19520163.         1000000    187
##  6 mature 17+      Old Update           875646.          100000    118
##  7 mature 17+      Recent Update       8489675.          500000    275
##  8 teen            Old Update          1625562.           50000    441
##  9 teen            Recent Update      26504878.         1000000    595
## 10 unrated         Old Update            25250            25250      2
# 7. Visualization of update recency effect
ggplot(data_analysis, aes(x = Content.Rating, y = log10(Installs), fill = update_recency)) +
  geom_boxplot() +
  labs(title = "Install Distribution by Content Rating and Update Recency",
       x = "Content Rating",
       y = "Log10(Number of Installs)",
       fill = "Update Recency") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Visualization for Last Updated vs Content Rating vs Installs

# 3. Timeline analysis: Average installs over time by content rating
installs_timeline <- data_final %>%
  group_by(Content.Rating, Last.Updated) %>%
  summarise(avg_installs = mean(Installs, na.rm = TRUE)) %>%
  ungroup()

ggplot(installs_timeline, aes(x = Last.Updated, y = log10(avg_installs), color = Content.Rating)) +
  geom_smooth(method = "loess", se = FALSE) +
  labs(title = "Average App Installs Over Time by Content Rating",
       x = "Last Updated Date",
       y = "Log10(Average Installs)") +
  theme_minimal() +
  theme(legend.position = "bottom")

Statistical Tests

Statistical test for Installs and Price

# Check for missing values and ensure no negative/zero values in log_Installs
#data_final <- data_final %>%
  #filter(!is.na(Installs), Installs > 0)  # Remove missing values and zeros in Installs

# Apply log transformation, adding 1 to avoid log(0)
#data_final$log_Installs <- log(data_final$Installs + 1)

# Ensure Price_Category has no missing values
#data_final <- data_final %>%
 #filter(!is.na(Price_Category))

#Perform t-test on log-transformed Installs by Price Category
#t_test_result <- t.test(log_Installs ~ Price_Category, data = data_final, var.equal = FALSE)

#Print t-test results
#print(t_test_result)

There is a statistically significant difference between the number of installs for “Free” and “Paid” apps, with the p-value being extremely small.

From the above analysis, we can practically state that free apps are more popular than paid apps, which can be considered true in the app market.

T-Test for Reviews and Price

#Confirming with a t-test
# Perform t-test for Reviews between Free and Paid
t_test_reviews <- t.test(Reviews ~ Price_Category, data = data_final)

# Perform t-test for Rating between Free and Paid
t_test_rating <- t.test(Rating ~ Price_Category, data = data_final)

# Print the results
print(t_test_reviews)
## 
##  Welch Two Sample t-test
## 
## data:  Reviews by Price_Category
## t = 11.019, df = 9299.1, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group Free and group Paid is not equal to 0
## 95 percent confidence interval:
##  185401.3 265636.3
## sample estimates:
## mean in group Free mean in group Paid 
##         234243.689           8724.888
print(t_test_rating)
## 
##  Welch Two Sample t-test
## 
## data:  Rating by Price_Category
## t = -3.9443, df = 883.57, p-value = 8.638e-05
## alternative hypothesis: true difference in means between group Free and group Paid is not equal to 0
## 95 percent confidence interval:
##  -0.1121028 -0.0376075
## sample estimates:
## mean in group Free mean in group Paid 
##           4.167384           4.242239
  • There is a statistically significant difference between the mean number of reviews for Free and Paid apps. Free apps have significantly more reviews on average.

  • There is a statistically significant difference between the mean ratings for Free and Paid apps. Paid apps have slightly higher ratings on average, though the difference is small.

ANOVA Test for Reviews vs Ratings

The tests below are to test whether or not different review categories have different average ratings.

anova_result <- aov(Rating ~ as.factor(Review_Category), data = data_clean)
summary(anova_result)
##                              Df Sum Sq Mean Sq F value Pr(>F)    
## as.factor(Review_Category)   11  106.3   9.662   41.36 <2e-16 ***
## Residuals                  9647 2253.6   0.234                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

According to p-value, it is significant hence we can say that the average rating for all review categories is not same.

Post Hoc Test

# Perform Tukey's HSD
tukey_result <- TukeyHSD(anova_result)
tukey_result
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = Rating ~ as.factor(Review_Category), data = data_clean)
## 
## $`as.factor(Review_Category)`
##                     diff          lwr         upr     p adj
## 100+-0+     -0.096683215 -0.152307271 -0.04105916 0.0000009
## 500+-0+     -0.063032835 -0.141474646  0.01540898 0.2646281
## 1K+-0+      -0.019190832 -0.089971134  0.05158947 0.9992526
## 2.5K+-0+     0.003350463 -0.074143085  0.08084401 1.0000000
## 5K+-0+       0.064918154 -0.012646893  0.14248320 0.2087515
## 10K+-0+      0.095614797  0.030638525  0.16059107 0.0000973
## 25K+-0+      0.105627098  0.035846939  0.17540726 0.0000488
## 50K+-0+      0.167554014  0.091642554  0.24346547 0.0000000
## 100K+-0+     0.203608898  0.135724795  0.27149300 0.0000000
## 300K+-0+     0.249388670  0.170111342  0.32866600 0.0000000
## 1M+-0+       0.300139945  0.211244127  0.38903576 0.0000000
## 500+-100+    0.033650380 -0.054364565  0.12166533 0.9848292
## 1K+-100+     0.077492383 -0.003768703  0.15875347 0.0784345
## 2.5K+-100+   0.100033678  0.012862795  0.18720456 0.0096675
## 5K+-100+     0.161601369  0.074366918  0.24883582 0.0000001
## 10K+-100+    0.192298012  0.116039053  0.26855697 0.0000000
## 25K+-100+    0.202310313  0.121918874  0.28270175 0.0000000
## 50K+-100+    0.264237229  0.178469737  0.35000472 0.0000000
## 100K+-100+   0.300292113  0.221540831  0.37904339 0.0000000
## 300K+-100+   0.346071885  0.257311491  0.43483228 0.0000000
## 1M+-100+     0.396823160  0.299375844  0.49427048 0.0000000
## 1K+-500+     0.043842003 -0.054455739  0.14213974 0.9515761
## 2.5K+-500+   0.066383298 -0.036853541  0.16962014 0.6214468
## 5K+-500+     0.127950989  0.024660470  0.23124151 0.0030189
## 10K+-500+    0.158647632  0.064443010  0.25285225 0.0000025
## 25K+-500+    0.168659933  0.071079887  0.26623998 0.0000011
## 50K+-500+    0.230586849  0.128532233  0.33264146 0.0000000
## 100K+-500+   0.266641733  0.170408442  0.36287502 0.0000000
## 300K+-500+   0.312421505  0.207839051  0.41700396 0.0000000
## 1M+-500+     0.363172780  0.251123410  0.47522215 0.0000000
## 2.5K+-1K+    0.022541295 -0.075001405  0.12008400 0.9998394
## 5K+-1K+      0.084108986 -0.013490527  0.18170850 0.1727899
## 10K+-1K+     0.114805629  0.026878134  0.20273312 0.0012014
## 25K+-1K+     0.124817930  0.033283243  0.21635262 0.0005180
## 50K+-1K+     0.186744846  0.090454254  0.28303544 0.0000000
## 100K+-1K+    0.222799730  0.132702117  0.31289734 0.0000000
## 300K+-1K+    0.268579502  0.169613735  0.36754527 0.0000000
## 1M+-1K+      0.319330777  0.212504774  0.42615678 0.0000000
## 5K+-2.5K+    0.061567691 -0.041004546  0.16413993 0.7193424
## 10K+-2.5K+   0.092264334 -0.001152170  0.18568084 0.0565429
## 25K+-2.5K+   0.102276635  0.005457227  0.19909604 0.0276896
## 50K+-2.5K+   0.164203551  0.062875978  0.26553112 0.0000078
## 100K+-2.5K+  0.200258435  0.104796512  0.29572036 0.0000000
## 300K+-2.5K+  0.246038206  0.142165102  0.34991131 0.0000000
## 1M+-2.5K+    0.296789482  0.185401898  0.40817707 0.0000000
## 10K+-5K+     0.030696643 -0.062779181  0.12417247 0.9957463
## 25K+-5K+     0.040708944 -0.056167701  0.13758559 0.9685508
## 50K+-5K+     0.102635860  0.001253596  0.20401812 0.0440982
## 100K+-5K+    0.138690744  0.043170771  0.23421072 0.0001331
## 300K+-5K+    0.184470516  0.080544059  0.28839697 0.0000004
## 1M+-5K+      0.235221791  0.123784453  0.34665913 0.0000000
## 25K+-10K+    0.010012302 -0.077112114  0.09713672 0.9999999
## 50K+-10K+    0.071939217 -0.020169104  0.16404754 0.3070668
## 100K+-10K+   0.107994101  0.022380758  0.19360745 0.0022235
## 300K+-10K+   0.153773873  0.058872409  0.24867534 0.0000078
## 1M+-10K+     0.204525148  0.101453039  0.30759726 0.0000000
## 50K+-25K+    0.061926916 -0.033630908  0.15748474 0.6094814
## 100K+-25K+   0.097981800  0.008667751  0.18729585 0.0175649
## 300K+-25K+   0.143761571  0.045508620  0.24201452 0.0001113
## 1M+-25K+     0.194512847  0.088346871  0.30067882 0.0000001
## 100K+-50K+   0.036054884 -0.058127272  0.13023704 0.9846717
## 300K+-50K+   0.081834656 -0.020863551  0.18453286 0.2768896
## 1M+-50K+     0.132585931  0.022293168  0.24287869 0.0048805
## 300K+-100K+  0.045779772 -0.051135776  0.14269532 0.9282456
## 1M+-100K+    0.096531047 -0.008398431  0.20146052 0.1064662
## 1M+-300K+    0.050751275 -0.061884591  0.16338714 0.9479902
# Convert the result to a data frame
tukey_df <- as.data.frame(tukey_result$`as.factor(Review_Category)`)

# Filter for significant p-values
significant_tukey <- tukey_df[tukey_df[4] < 0.05, ]

# Display the significant results
print(significant_tukey)
##                    diff          lwr         upr        p adj
## 100+-0+     -0.09668322 -0.152307271 -0.04105916 8.987756e-07
## 10K+-0+      0.09561480  0.030638525  0.16059107 9.732720e-05
## 25K+-0+      0.10562710  0.035846939  0.17540726 4.884843e-05
## 50K+-0+      0.16755401  0.091642554  0.24346547 0.000000e+00
## 100K+-0+     0.20360890  0.135724795  0.27149300 0.000000e+00
## 300K+-0+     0.24938867  0.170111342  0.32866600 0.000000e+00
## 1M+-0+       0.30013994  0.211244127  0.38903576 0.000000e+00
## 2.5K+-100+   0.10003368  0.012862795  0.18720456 9.667490e-03
## 5K+-100+     0.16160137  0.074366918  0.24883582 9.538328e-08
## 10K+-100+    0.19229801  0.116039053  0.26855697 0.000000e+00
## 25K+-100+    0.20231031  0.121918874  0.28270175 0.000000e+00
## 50K+-100+    0.26423723  0.178469737  0.35000472 0.000000e+00
## 100K+-100+   0.30029211  0.221540831  0.37904339 0.000000e+00
## 300K+-100+   0.34607188  0.257311491  0.43483228 0.000000e+00
## 1M+-100+     0.39682316  0.299375844  0.49427048 0.000000e+00
## 5K+-500+     0.12795099  0.024660470  0.23124151 3.018884e-03
## 10K+-500+    0.15864763  0.064443010  0.25285225 2.473396e-06
## 25K+-500+    0.16865993  0.071079887  0.26623998 1.080775e-06
## 50K+-500+    0.23058685  0.128532233  0.33264146 0.000000e+00
## 100K+-500+   0.26664173  0.170408442  0.36287502 0.000000e+00
## 300K+-500+   0.31242150  0.207839051  0.41700396 0.000000e+00
## 1M+-500+     0.36317278  0.251123410  0.47522215 0.000000e+00
## 10K+-1K+     0.11480563  0.026878134  0.20273312 1.201416e-03
## 25K+-1K+     0.12481793  0.033283243  0.21635262 5.179950e-04
## 50K+-1K+     0.18674485  0.090454254  0.28303544 1.572425e-08
## 100K+-1K+    0.22279973  0.132702117  0.31289734 0.000000e+00
## 300K+-1K+    0.26857950  0.169613735  0.36754527 0.000000e+00
## 1M+-1K+      0.31933078  0.212504774  0.42615678 0.000000e+00
## 25K+-2.5K+   0.10227664  0.005457227  0.19909604 2.768961e-02
## 50K+-2.5K+   0.16420355  0.062875978  0.26553112 7.808701e-06
## 100K+-2.5K+  0.20025843  0.104796512  0.29572036 3.507881e-10
## 300K+-2.5K+  0.24603821  0.142165102  0.34991131 0.000000e+00
## 1M+-2.5K+    0.29678948  0.185401898  0.40817707 0.000000e+00
## 50K+-5K+     0.10263586  0.001253596  0.20401812 4.409823e-02
## 100K+-5K+    0.13869074  0.043170771  0.23421072 1.331239e-04
## 300K+-5K+    0.18447052  0.080544059  0.28839697 4.428778e-07
## 1M+-5K+      0.23522179  0.123784453  0.34665913 2.244944e-10
## 100K+-10K+   0.10799410  0.022380758  0.19360745 2.223466e-03
## 300K+-10K+   0.15377387  0.058872409  0.24867534 7.832139e-06
## 1M+-10K+     0.20452515  0.101453039  0.30759726 5.942656e-09
## 100K+-25K+   0.09798180  0.008667751  0.18729585 1.756493e-02
## 300K+-25K+   0.14376157  0.045508620  0.24201452 1.113055e-04
## 1M+-25K+     0.19451285  0.088346871  0.30067882 1.436204e-07
## 1M+-50K+     0.13258593  0.022293168  0.24287869 4.880458e-03

As we can see, the significant difference for average rating for different review categories is between 0+ and 1M+ as expected.

For easier Ratings and Reviews vs Installs we can group Installs into categories given

ANOVA test for Content Rating vs Installs

# 1. Encode content rating (e.g., as factor levels or one-hot encoding)
data_final$Content.Rating <- as.factor(data_final$Content.Rating)

data_final <- data_final %>%
  filter(!is.na(Installs) & Installs > 0)

# ANOVA test for difference in installs between content ratings
install_anova <- aov(log10(Installs) ~ Content.Rating, data = data_final)

print("\nANOVA test results for Installs by Content Rating:")
## [1] "\nANOVA test results for Installs by Content Rating:"
print(summary(install_anova))
##                  Df Sum Sq Mean Sq F value Pr(>F)    
## Content.Rating    5    743  148.68   41.95 <2e-16 ***
## Residuals      9638  34160    3.54                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

ANOVA analysis : Revealed significant differences in install counts based on content rating (F(5, 9638) = 41.95, p < 2e-16). This indicates that various content ratings have a substantial impact on the number of installs, highlighting the importance of content quality and type in attracting users.

Correlation

Correlation for all variables in data_final

Lets convert all the categorical variables into factors and then convert into numerical dataframe for calucalting the correlation matrix

# Step 1: Create a copy of the original data without specific columns
columns_to_remove <- c("App", "Scaled_Reviews", "update_year", "update_month", 
                        "update_quarter", "days_since_update", "week_of_year", "Last.Updated","day_of_week","month_of_year","season")

data_numeric_or_factor <- data_final %>%
  select(-one_of(columns_to_remove))



# Step 2: Convert specified categorical columns to factors

data_factor <- data_numeric_or_factor

# Step 3: Identify categorical columns
categorical_columns <- sapply(data_numeric_or_factor, is.factor)

# Step 4: Convert each categorical variable to numeric
data_final_numeric <- data_numeric_or_factor  # Copy of the data
data_final_numeric[categorical_columns] <- lapply(data_numeric_or_factor[categorical_columns], 
                                                   function(x) as.numeric(as.factor(x)))


# Step 5: Calculate Pearson correlation
correlation_matrix <- cor(data_final_numeric,method = "pearson", use = "complete.obs")
print(correlation_matrix)
##                   Category      Rating      Reviews        Size     Installs
## Category        1.00000000 -0.03751629  0.017314782 -0.12554584  0.031686330
## Rating         -0.03751629  1.00000000  0.055012661  0.05628115  0.040069031
## Reviews         0.01731478  0.05501266  1.000000000  0.07551130  0.625154887
## Size           -0.12554584  0.05628115  0.075511296  1.00000000  0.040696285
## Installs        0.03168633  0.04006903  0.625154887  0.04069628  1.000000000
## Price          -0.01529234 -0.02104069 -0.007251784 -0.02144237 -0.008990597
## Content.Rating -0.09472403  0.02593420  0.055673725  0.18320876  0.049856457
## Android.Ver     0.09103984  0.05806899  0.106378527  0.07349807  0.158803620
##                       Price Content.Rating  Android.Ver
## Category       -0.015292339    -0.09472403  0.091039837
## Rating         -0.021040692     0.02593420  0.058068988
## Reviews        -0.007251784     0.05567372  0.106378527
## Size           -0.021442367     0.18320876  0.073498068
## Installs       -0.008990597     0.04985646  0.158803620
## Price           1.000000000    -0.01236215 -0.008206668
## Content.Rating -0.012362146     1.00000000 -0.003978120
## Android.Ver    -0.008206668    -0.00397812  1.000000000
# Caluclate the spearman 
correlation_matrix1 <- cor(data_final_numeric, method = "spearman", use = "complete.obs")
print(correlation_matrix1)
##                   Category       Rating     Reviews        Size    Installs
## Category        1.00000000 -0.023119863  0.05869707 -0.11488085  0.06635885
## Rating         -0.02311986  1.000000000  0.20010260  0.07360797  0.11910109
## Reviews         0.05869707  0.200102595  1.00000000  0.33103200  0.96758410
## Size           -0.11488085  0.073607965  0.33103200  1.00000000  0.31015168
## Installs        0.06635885  0.119101094  0.96758410  0.31015168  1.00000000
## Price           0.01216126  0.055577543 -0.14623948 -0.04307736 -0.22792137
## Content.Rating -0.10880677  0.006133931  0.16571383  0.19614059  0.13996576
## Android.Ver     0.08998890  0.079906580  0.19199325  0.24650879  0.19548167
##                      Price Content.Rating  Android.Ver
## Category        0.01216126   -0.108806771  0.089988903
## Rating          0.05557754    0.006133931  0.079906580
## Reviews        -0.14623948    0.165713834  0.191993251
## Size           -0.04307736    0.196140589  0.246508786
## Installs       -0.22792137    0.139965764  0.195481669
## Price           1.00000000   -0.036667323 -0.098542484
## Content.Rating -0.03666732    1.000000000 -0.006319451
## Android.Ver    -0.09854248   -0.006319451  1.000000000
# Step 6: Plot the correlation matrix
corrplot(correlation_matrix, method = "color", addCoef.col = "black")

corrplot(correlation_matrix1, method = "color", addCoef.col = "black")

As seen installs has the highest correlation with the reviews.

As we can see from the both pearson and spearman have relatively different correlation matrices and plots. We can refer to the categorical variables correlation from the spearman.

Correlation Reviews

reviews_correlation_factor <- correlation_matrix[, "Reviews", drop = FALSE]

reviews_correlation_factor1 <- correlation_matrix1[, "Reviews", drop = FALSE]

# Print the correlation matrix for Reviews from numeric factor data
print(reviews_correlation_factor)
##                     Reviews
## Category        0.017314782
## Rating          0.055012661
## Reviews         1.000000000
## Size            0.075511296
## Installs        0.625154887
## Price          -0.007251784
## Content.Rating  0.055673725
## Android.Ver     0.106378527
# Step 6: Create a correlation plot for Reviews in data_numeric_or_factor
corrplot(reviews_correlation_factor, method = "color", addCoef.col = "black", 
         title = "Correlation of Reviews with Other Variables (Factor Data)", 
         tl.col = "black", tl.srt = 45)

corrplot(reviews_correlation_factor1, method = "color", addCoef.col = "black", 
         title = "Correlation of Reviews with Other Variables (Factor Data)", 
         tl.col = "black", tl.srt = 45)

As seen reviews has the highest correlation(positive) with the installs and then in spearman correlation matrix it has high correlation(positive) with content rating and android version meaning

Correlation with Rating

rating_correlation_factor <- correlation_matrix[, "Rating", drop = FALSE]

rating_correlation_factor1 <- correlation_matrix1[, "Rating", drop = FALSE]

# Print the correlation matrix for Reviews from numeric factor data
print(rating_correlation_factor)
##                     Rating
## Category       -0.03751629
## Rating          1.00000000
## Reviews         0.05501266
## Size            0.05628115
## Installs        0.04006903
## Price          -0.02104069
## Content.Rating  0.02593420
## Android.Ver     0.05806899
# Step 6: Create a correlation plot for Reviews in data_numeric_or_factor
corrplot(rating_correlation_factor, method = "color", addCoef.col = "black", 
         title = "Correlation of Reviews with Other Variables (Factor Data)", 
         tl.col = "black", tl.srt = 45)

corrplot(rating_correlation_factor1, method = "color", addCoef.col = "black", 
         title = "Correlation of Reviews with Other Variables (Factor Data)", 
         tl.col = "black", tl.srt = 45)

Rating is not much correlated with any of the variables, only slightly positively correlated with reviews and installs which was also demonstrated through visualisation previously.

Correlation with Price

# Spearman correlation for Price
price_correlation_factor1 <- correlation_matrix1[, "Price", drop = FALSE]
print("Spearman Correlation of Price with Other Variables:")
## [1] "Spearman Correlation of Price with Other Variables:"
print(price_correlation_factor1)
##                      Price
## Category        0.01216126
## Rating          0.05557754
## Reviews        -0.14623948
## Size           -0.04307736
## Installs       -0.22792137
## Price           1.00000000
## Content.Rating -0.03666732
## Android.Ver    -0.09854248
# Plot for Spearman correlation with Price
corrplot(price_correlation_factor1, method = "color", addCoef.col = "black", 
         title = "Correlation of Price with Other Variables (Spearman)", 
         tl.col = "black", tl.srt = 45)

Price vs. Log_Installs: -0.06, suggesting a very weak negative relationship between price and the number of installs.

Correlation between time analysis variables VS Installs

# Create a new data frame with relevant variables for correlation analysis
#correlation_data <- data_analysis %>%
#  select(days_since_update, update_year, update_month) %>%
#  mutate(log_installs = log10(data_final$Installs))

# Calculate the correlation matrix
#correlation_matrix <- cor(correlation_data, method = "spearman", use = "complete.obs")

# Print the correlation matrix
#print("Spearman Correlation Matrix:")
#corrplot(correlation_matrix, method = "color", 
#          col = colorRampPalette(c("red", "white", "blue"))(200),
#          type = "upper", 
#          tl.col = "black", tl.srt = 45, 
#          addCoef.col = "black", # Add correlation coefficients
#          number.cex = 0.7,      # Adjust size of numbers
#          title = "Correlation Matrix", # Title
#          mar = c(0, 0, 1, 0))   # Margins

Correlation Analysis: A moderate negative correlation :(ρ=−0.3317) was found between the number of days since the last update and the log-transformed installs. This indicates that as the time since the last update increases, the number of installs tends to decrease. The relationship is statistically significant (p < 2.2e-16), suggesting that timely updates may be crucial for maintaining user engagement.

Chi-square Test for Content Rating vs Last Updated

# 3.2 Statistical Tests

# Chi-square test for independence
contingency_table <- table(data_final$Content.Rating, data_final$update_quarter)
chi_test <- chisq.test(contingency_table)
print("\nChi-square test for independence between Content Rating and Update Quarter:")
## [1] "\nChi-square test for independence between Content Rating and Update Quarter:"
print(chi_test)
## 
##  Pearson's Chi-squared test
## 
## data:  contingency_table
## X-squared = 87.726, df = 15, p-value = 2.63e-12

The P value is small signifying that there is statistically significant relationship between Content Rating and Last Updated quarter

Implications These findings suggest that regular updates are important for sustaining app installs, and that different content ratings can influence user engagement. Strategies aimed at timely updates and optimizing content ratings could enhance app performance and user acquisition.